|

EXPLANATORY DATA ANALYSIS

|

| Data Analyst Nanodegree, Udacity vamshi.krishna.prime@gmail.com |

|


Table of contents:

1. Import libraries

2. Load Data

3. Explanatory Data Analysis

4. Investigation Summary

5. Credits


Explore Bikeshare Data and communicate data findings

In [86]:
from IPython.display import Image
Image("img/Metro Bikeshare.jpg")
Out[86]:

Image description: image of Metro Bike bikeshare.


Investigation Overview:

The investigation of the dataset is focussed on factors that influence the bike rentals and the reforms that can be taken to improvize the bike rentals based on the customer preferences and hidden trends.

Dataset Overview:

The bikeshare data consists of 3 years data related to bike type, ride type, customer pass type, ride timeline, distance along with geographical data. Other varibles like fare, distance_miles and fare type are feature engineered for deeper analysis. The data needed wrangling and cleaning operations which are fulfilled in the ACT 1 of the process. The data is stored in a relational database categorized into individual tables as per large data storage techniques.


1. Import libraries

===========================

In [87]:
import numpy as np
import pandas as pd
import math
import matplotlib.pyplot as plt
import seaborn as sb
from sqlalchemy import create_engine
%matplotlib inline
from matplotlib.lines import Line2D
import matplotlib.patches as patches

# suppress warnings from final output
import warnings
warnings.simplefilter("ignore")

2. Load Data

=================

  • Available as Flat File:
  • Available as Database:
Dataset Available format Description Mode of access
bikeshare_clean bikeshare_master.csv A clean dateset in csv format Load directly using read_csv method in pandas
bikeshare_clean bikeshare_master.db A relational database Requires SQL query to gather data

2.1 Load Data using `SQL Query`:

ETL Table for the Bikeshare realtional database

Dataset bike time fare station
variable trip_id trip_id trip_id trip_id
variable bike_id start_time trip_id start_station_id
variable trip_type end_time fare start_lat
variable bike_type duration fare_type start_lon
variable passholder_type distance_miles end_station_id
variable end_lat
variable end_lon
In [88]:
engine = create_engine('sqlite:///bikeshare_master.db')
In [89]:
# Import data from the database into a dataframe using SQL query
bikeshare = pd.read_sql('SELECT b.trip_id, \
                                b.bike_id, \
                                b.trip_type, \
                                b.bike_type, \
                                b.passholder_type AS pass_type, \
                                f.fare_type, \
                                t.start_time, \
                                t.end_time, \
                                t.duration AS duration_min, \
                                t.distance_miles, \
                                f.fare, \
                                s.start_station_id, \
                                s.start_lat, \
                                s.start_lon, \
                                s.end_station_id, \
                                s.end_lat, \
                                s.end_lon \
                           FROM bike AS b \
                           JOIN time AS t \
                             ON b.trip_id = t.trip_id \
                           JOIN fare AS f \
                             ON b.trip_id = f.trip_id \
                           JOIN station AS s \
                             ON t.trip_id = s.trip_id', engine)

Alternate approach is to load data from the flat file in CSV format.

# optional to execute: an alternate approach to load data bikeshare = pd.read_csv('bikeshare_master.csv', sep=',', low_memory=False)

2.2 Restore Dataset properties:

In [90]:
bikeshare.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 808589 entries, 0 to 808588
Data columns (total 17 columns):
trip_id             808589 non-null int64
bike_id             808589 non-null int64
trip_type           808589 non-null object
bike_type           808589 non-null object
pass_type           808589 non-null object
fare_type           808589 non-null object
start_time          808589 non-null object
end_time            808589 non-null object
duration_min        808589 non-null int64
distance_miles      808589 non-null float64
fare                808589 non-null float64
start_station_id    808589 non-null int64
start_lat           808589 non-null float64
start_lon           808589 non-null float64
end_station_id      808589 non-null int64
end_lat             808589 non-null float64
end_lon             808589 non-null float64
dtypes: float64(6), int64(5), object(6)
memory usage: 104.9+ MB

Not all columns retain their datatype information while retreving the dataset from the database. This is because of transition of data from one format/platform to another. The incorrect column datatypes are to be manually assigned.

In [91]:
level_order = ['One Way', 'Round Trip']
ordered_cat = pd.api.types.CategoricalDtype(ordered = True, categories = level_order)
bikeshare['trip_type'] = bikeshare['trip_type'].astype(ordered_cat)

level_order = ['unknown', 'Standard', 'Electric', 'Smart']
ordered_cat = pd.api.types.CategoricalDtype(ordered = True, categories = level_order)
bikeshare['bike_type'] = bikeshare['bike_type'].astype(ordered_cat)

level_order = ['Walk-up', 'One Day', 'Monthly', 'Flex', 'Annual']
ordered_cat = pd.api.types.CategoricalDtype(ordered = True, categories = level_order)
bikeshare['pass_type'] = bikeshare['pass_type'].astype(ordered_cat)

level_order = ['Base', 'Extended']
ordered_cat = pd.api.types.CategoricalDtype(ordered = True, categories = level_order)
bikeshare['fare_type'] = bikeshare['fare_type'].astype(ordered_cat)

bikeshare['start_time'] = pd.to_datetime(bikeshare['start_time'])
bikeshare['end_time'] = pd.to_datetime(bikeshare['end_time'])
In [92]:
bikeshare.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 808589 entries, 0 to 808588
Data columns (total 17 columns):
trip_id             808589 non-null int64
bike_id             808589 non-null int64
trip_type           808589 non-null category
bike_type           808589 non-null category
pass_type           808589 non-null category
fare_type           808589 non-null category
start_time          808589 non-null datetime64[ns]
end_time            808589 non-null datetime64[ns]
duration_min        808589 non-null int64
distance_miles      808589 non-null float64
fare                808589 non-null float64
start_station_id    808589 non-null int64
start_lat           808589 non-null float64
start_lon           808589 non-null float64
end_station_id      808589 non-null int64
end_lat             808589 non-null float64
end_lon             808589 non-null float64
dtypes: category(4), datetime64[ns](2), float64(6), int64(5)
memory usage: 83.3 MB

2.3 Feature Engineering:

Expand the dataset by extracting timeline variables for further plotting

The time series data related to rentals hour/day/week/month/year needs to be prepared/extracted for further plotting.

In [93]:
%%time
# create a timeline variables from the existing data
bikeshare['year'] = bikeshare['start_time'].dt.year
bikeshare['month'] = bikeshare['start_time'].dt.month
bikeshare['weekday'] = bikeshare['start_time'].dt.weekday
bikeshare['day'] = bikeshare['start_time'].dt.day
bikeshare['hour'] = bikeshare['start_time'].dt.hour

bikeshare[['year', 'month', 'weekday', 'day', 'hour']].head()
Wall time: 2.14 s
Out[93]:
year month weekday day hour
0 2017 1 6 1 0
1 2017 1 6 1 0
2 2017 1 6 1 0
3 2017 1 6 1 0
4 2017 1 6 1 0

Extract daytime from the hour column:

Extract day_section from hour column.

In [94]:
# divide the hour of the day into customized sections
bin = [-1,5,11,16,20,23]
bikeshare['day_sections'] = pd.cut(bikeshare['start_time'].dt.hour,bin)
bikeshare['day_sections'].head(10)
Out[94]:
0    (-1, 5]
1    (-1, 5]
2    (-1, 5]
3    (-1, 5]
4    (-1, 5]
5    (-1, 5]
6    (-1, 5]
7    (-1, 5]
8    (-1, 5]
9    (-1, 5]
Name: day_sections, dtype: category
Categories (5, interval[int64]): [(-1, 5] < (5, 11] < (11, 16] < (16, 20] < (20, 23]]

Explore the various methods to extract the sections of the day based on the hour of the day. To calculate the method with most performance (less time to extract the values), take the first 1000 entries in the dataset and calculate the execution time.

In [95]:
%%capture --no-stdout


def apply_section(row):
    if row in df_new.day_sections.unique()[0] :
        return 'Early hours'
    if row in df_new.day_sections.unique()[1] :
        return 'Morning'
    if row in df_new.day_sections.unique()[2] :
        return 'Afternoon'
    if row in df_new.day_sections.unique()[3] :
        return 'Evening'
    if row in df_new.day_sections.unique()[4] :
        return 'Night'
    return 'unknown'


def map_identity(row):
    if row in df_new.day_sections.unique()[0] :
        return 'Early hours'
    if row in df_new.day_sections.unique()[1] :
        return 'Morning'
    if row in df_new.day_sections.unique()[2] :
        return 'Afternoon'
    if row in df_new.day_sections.unique()[3] :
        return 'Evening'
    if row in df_new.day_sections.unique()[4] :
        return 'Night'
    return 'unknown'


def map_identity2(row):
    if row == df_new.day_sections.unique()[0] :
        return 'Early hours'
    if row == df_new.day_sections.unique()[1] :
        return 'Morning'
    if row == df_new.day_sections.unique()[2] :
        return 'Afternoon'
    if row == df_new.day_sections.unique()[3] :
        return 'Evening'
    if row == df_new.day_sections.unique()[4] :
        return 'Night'
    return 'unknown'


def mask_section(df):
    df['label4'] = df['day_sections'].mask(df.day_sections==df.day_sections.unique()[0], 'Early hours')
    df['label4'] = df['day_sections'].mask(df.day_sections==df.day_sections.unique()[1], 'Morning')
    df['label4'] = df['day_sections'].mask(df.day_sections==df.day_sections.unique()[2], 'Afternoon')
    df['label4'] = df['day_sections'].mask(df.day_sections==df.day_sections.unique()[3], 'Evening')
    df['label4'] = df['day_sections'].mask(df.day_sections==df.day_sections.unique()[4], 'Night')


def npwhere_section(df):
    df['label5'] = np.where(df.day_sections == df.day_sections.unique()[0], 'Early hours', df.day_sections)
    df['label5'] = np.where(df.day_sections == df.day_sections.unique()[1], 'Morning', df.label5)
    df['label5'] = np.where(df.day_sections == df.day_sections.unique()[2], 'Afternoon', df.label5)
    df['label5'] = np.where(df.day_sections == df.day_sections.unique()[3], 'Evening', df.label5)
    df['label5'] = np.where(df.day_sections == df.day_sections.unique()[4], 'Night', df.label5)


def loc_section(df):
    df.loc[df['day_sections'] == df.day_sections.unique()[0],'label6'] = 'Early hours'
    df.loc[df['day_sections'] == df.day_sections.unique()[1],'label6'] = 'Morning'
    df.loc[df['day_sections'] == df.day_sections.unique()[2],'label6'] = 'Afternoon'
    df.loc[df['day_sections'] == df.day_sections.unique()[3],'label6'] = 'Evening'
    df.loc[df['day_sections'] == df.day_sections.unique()[4],'label6'] = 'Night'



df_new = bikeshare.head(1000).copy()

%time df_new['label1'] = df_new['hour'].apply(lambda row: apply_section(row))
%time df_new['label2'] = df_new['hour'].map(map_identity)
%time df_new['label3'] = df_new['day_sections'].map(map_identity2)
%time mask_section(df_new)
%time npwhere_section(df_new)
%time loc_section(df_new)
Wall time: 3.33 s
Wall time: 3.32 s
Wall time: 19.6 ms
Wall time: 245 ms
Wall time: 20 ms
Wall time: 161 ms

From the above, it is evident that np.where, map method and .loc method (vectorized operations) yields the most performance. However on larger datasets, .loc method perform better.

In [96]:
from IPython.display import Image
Image("img/performance chart.PNG", width = 600, height = 300)
Out[96]:

It can be determined from the above steps that .loc method is the best solution to add new column by extracting/comparing values from the existing column.

Extract daytime from day_section.

In [97]:
%%time

def assign_daytime(df):
    df.loc[df['day_sections'] == df.day_sections.unique()[0],'daytime'] = 'Early hours'
    df.loc[df['day_sections'] == df.day_sections.unique()[1],'daytime'] = 'Morning'
    df.loc[df['day_sections'] == df.day_sections.unique()[2],'daytime'] = 'Afternoon'
    df.loc[df['day_sections'] == df.day_sections.unique()[3],'daytime'] = 'Evening'
    df.loc[df['day_sections'] == df.day_sections.unique()[4],'daytime'] = 'Night'
    

assign_daytime(bikeshare)
bikeshare.daytime.value_counts()
Wall time: 1.09 s
Out[97]:
Afternoon      291403
Evening        222231
Morning        210297
Night           59605
Early hours     25053
Name: daytime, dtype: int64

As estimated, .loc method exhibited the best perormance by extracting the daytime values from the day_sections coulmns with 808589 entries around 1 second.

In [98]:
# display a sample of 'daytime' entries for visual confirmation
bikeshare[['day_sections', 'daytime']].sample(10)
Out[98]:
day_sections daytime
251507 (11, 16] Afternoon
645361 (5, 11] Morning
589732 (5, 11] Morning
128620 (16, 20] Evening
316445 (16, 20] Evening
517045 (5, 11] Morning
679170 (11, 16] Afternoon
94031 (16, 20] Evening
106450 (11, 16] Afternoon
54263 (5, 11] Morning

Change weekday representation:

change the weekday representation from numeric values to descriptive values. Aforementioned, use .loc method to extract new column from the existing column values.

Integer Value Day of the week
0 Monday
1 Tuesday
2 Wednesday
3 Thursday
4 Friday
5 Saturday
6 Sunday
In [99]:
%%time

def assign_weekday(df):
    df.loc[df['weekday'] == 0,'weekday'] = 'Monday'
    df.loc[df['weekday'] == 1,'weekday'] = 'Tuesday'
    df.loc[df['weekday'] == 2,'weekday'] = 'Wednesday'
    df.loc[df['weekday'] == 3,'weekday'] = 'Thursday'
    df.loc[df['weekday'] == 4,'weekday'] = 'Friday'
    df.loc[df['weekday'] == 5,'weekday'] = 'Saturday'
    df.loc[df['weekday'] == 6,'weekday'] = 'Sunday'
    

assign_weekday(bikeshare)

# display a sample of 'daytime' entries for visual confirmation
bikeshare[['weekday']].sample(10)
Wall time: 1.23 s
Out[99]:
weekday
600482 Monday
710044 Wednesday
22061 Tuesday
262967 Thursday
532167 Tuesday
112713 Wednesday
272723 Friday
348608 Monday
509711 Wednesday
286609 Wednesday

Extract the relative number of the week in a month:

Each month bears either 3 or 4 weeks depending on the leap year and month itself. Extract the relative number of the week in each month.

In [100]:
bin = [0,7,14,21,28,31]
#use pd.cut function can attribute the values into its specific bins
bikeshare['week_sections'] = pd.cut(bikeshare['day'],bin)
bikeshare[['week_sections']].head()
Out[100]:
week_sections
0 (0, 7]
1 (0, 7]
2 (0, 7]
3 (0, 7]
4 (0, 7]
In [101]:
bikeshare.week_sections.unique()
Out[101]:
[(0, 7], (7, 14], (14, 21], (21, 28], (28, 31]]
Categories (5, interval[int64]): [(0, 7] < (7, 14] < (14, 21] < (21, 28] < (28, 31]]
In [102]:
%%time

def assign_week(df):
    df.loc[df['week_sections'] == df.week_sections.unique()[0],'week'] = 'First'
    df.loc[df['week_sections'] == df.week_sections.unique()[1],'week'] = 'Second'
    df.loc[df['week_sections'] == df.week_sections.unique()[2],'week'] = 'Third'
    df.loc[df['week_sections'] == df.week_sections.unique()[3],'week'] = 'Fourth'
    df.loc[df['week_sections'] == df.week_sections.unique()[4],'week'] = 'Fifth'
    

assign_week(bikeshare)
bikeshare.week.value_counts()
Wall time: 1.18 s
Out[102]:
Third     188745
Fourth    185200
Second    184644
First     183795
Fifth      66205
Name: week, dtype: int64
In [103]:
bikeshare[['week_sections', 'week']].sample(10)
Out[103]:
week_sections week
366992 (0, 7] First
577509 (14, 21] Third
578452 (14, 21] Third
53929 (21, 28] Fourth
373026 (7, 14] Second
266536 (21, 28] Fourth
463274 (0, 7] First
477520 (14, 21] Third
477721 (14, 21] Third
721541 (14, 21] Third

Extract quarter of the year from the month column:

Extract quarter_sections from month column.

In [104]:
# divide the hour of the day into customized sections
bin = [0,3,6,9,12]
#use pd.cut function to attribute the values into its specific bins
bikeshare['quarter_sections'] = pd.cut(bikeshare['start_time'].dt.month,bin)
bikeshare['quarter_sections'].sample(10)
Out[104]:
558947     (0, 3]
420277     (6, 9]
713596     (6, 9]
437319     (6, 9]
568048     (0, 3]
536975     (0, 3]
76149      (3, 6]
99088      (6, 9]
797956    (9, 12]
556634     (0, 3]
Name: quarter_sections, dtype: category
Categories (4, interval[int64]): [(0, 3] < (3, 6] < (6, 9] < (9, 12]]

Extract quarter from quarter_sections.

In [105]:
bikeshare.quarter_sections.unique()
Out[105]:
[(0, 3], (3, 6], (6, 9], (9, 12]]
Categories (4, interval[int64]): [(0, 3] < (3, 6] < (6, 9] < (9, 12]]
In [106]:
%%time

def extract_quarter(df):
    df.loc[df['quarter_sections'] == df.quarter_sections.unique()[0],'quarter'] = 'Q1'
    df.loc[df['quarter_sections'] == df.quarter_sections.unique()[1],'quarter'] = 'Q2'
    df.loc[df['quarter_sections'] == df.quarter_sections.unique()[2],'quarter'] = 'Q3'
    df.loc[df['quarter_sections'] == df.quarter_sections.unique()[3],'quarter'] = 'Q4'


extract_quarter(bikeshare)
bikeshare.quarter.value_counts()
Wall time: 1.06 s
Out[106]:
Q3    251474
Q4    215317
Q2    188588
Q1    153210
Name: quarter, dtype: int64

As estimated, .loc method exhibited the best perormance by extracting the quarter of the year values from the year_sections coulmns with 808589 entries under 1 second.

In [107]:
# display a sample of 'quarter' entries for visual confirmation
bikeshare[['quarter_sections', 'quarter']].sample(10)
Out[107]:
quarter_sections quarter
134970 (6, 9] Q3
612582 (3, 6] Q2
294403 (3, 6] Q2
279720 (0, 3] Q1
275427 (0, 3] Q1
628039 (3, 6] Q2
294233 (3, 6] Q2
223324 (9, 12] Q4
140058 (6, 9] Q3
694022 (6, 9] Q3

Change datatypes of multiple columns to ordered categorical dtype:

In [108]:
bikeshare.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 808589 entries, 0 to 808588
Data columns (total 28 columns):
trip_id             808589 non-null int64
bike_id             808589 non-null int64
trip_type           808589 non-null category
bike_type           808589 non-null category
pass_type           808589 non-null category
fare_type           808589 non-null category
start_time          808589 non-null datetime64[ns]
end_time            808589 non-null datetime64[ns]
duration_min        808589 non-null int64
distance_miles      808589 non-null float64
fare                808589 non-null float64
start_station_id    808589 non-null int64
start_lat           808589 non-null float64
start_lon           808589 non-null float64
end_station_id      808589 non-null int64
end_lat             808589 non-null float64
end_lon             808589 non-null float64
year                808589 non-null int64
month               808589 non-null int64
weekday             808589 non-null object
day                 808589 non-null int64
hour                808589 non-null int64
day_sections        808589 non-null category
daytime             808589 non-null object
week_sections       808589 non-null category
week                808589 non-null object
quarter_sections    808589 non-null category
quarter             808589 non-null object
dtypes: category(7), datetime64[ns](2), float64(6), int64(9), object(4)
memory usage: 134.9+ MB
In [109]:
df = bikeshare

level_order = ['Early hours', 'Morning', 'Afternoon', 'Evening', 'Night']
ordered_cat = pd.api.types.CategoricalDtype(ordered = True, categories = level_order)
df['daytime'] = df['daytime'].astype(ordered_cat)

level_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
ordered_cat = pd.api.types.CategoricalDtype(ordered = True, categories = level_order)
df['weekday'] = df['weekday'].astype(ordered_cat)

level_order = ['First', 'Second', 'Third', 'Fourth', 'Fifth']
ordered_cat = pd.api.types.CategoricalDtype(ordered = True, categories = level_order)
df['week'] = df['week'].astype(ordered_cat)

level_order = ['Q1', 'Q2', 'Q3', 'Q4']
ordered_cat = pd.api.types.CategoricalDtype(ordered = True, categories = level_order)
df['quarter'] = df['quarter'].astype(ordered_cat)
In [110]:
bikeshare.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 808589 entries, 0 to 808588
Data columns (total 28 columns):
trip_id             808589 non-null int64
bike_id             808589 non-null int64
trip_type           808589 non-null category
bike_type           808589 non-null category
pass_type           808589 non-null category
fare_type           808589 non-null category
start_time          808589 non-null datetime64[ns]
end_time            808589 non-null datetime64[ns]
duration_min        808589 non-null int64
distance_miles      808589 non-null float64
fare                808589 non-null float64
start_station_id    808589 non-null int64
start_lat           808589 non-null float64
start_lon           808589 non-null float64
end_station_id      808589 non-null int64
end_lat             808589 non-null float64
end_lon             808589 non-null float64
year                808589 non-null int64
month               808589 non-null int64
weekday             808589 non-null category
day                 808589 non-null int64
hour                808589 non-null int64
day_sections        808589 non-null category
daytime             808589 non-null category
week_sections       808589 non-null category
week                808589 non-null category
quarter_sections    808589 non-null category
quarter             808589 non-null category
dtypes: category(11), datetime64[ns](2), float64(6), int64(9)
memory usage: 113.4 MB

Remove redundant columns in the dataset:

In [111]:
cols_to_drop = ['day_sections', 'week_sections', 'quarter_sections']

bikeshare.drop(cols_to_drop, axis=1, inplace=True)
In [112]:
for i, col in enumerate(bikeshare.columns):
    print('{}'.format(i).ljust(2, " ") + ':' + '{}'.format(col))
0 :trip_id
1 :bike_id
2 :trip_type
3 :bike_type
4 :pass_type
5 :fare_type
6 :start_time
7 :end_time
8 :duration_min
9 :distance_miles
10:fare
11:start_station_id
12:start_lat
13:start_lon
14:end_station_id
15:end_lat
16:end_lon
17:year
18:month
19:weekday
20:day
21:hour
22:daytime
23:week
24:quarter

Reorder columns in the dataset:

reorder columns as relevant/numerical data to the left most for visual analysis

In [113]:
reordered_columns = ['trip_id', 'bike_id', 'distance_miles', 'duration_min', 'fare',
                     'trip_type', 'bike_type', 'pass_type', 'fare_type', 'start_time', 
                     'year', 'quarter', 'month', 'week', 'weekday', 'day', 'daytime','hour',
                     'end_time', 'start_station_id', 'start_lat', 'start_lon', 
                     'end_station_id', 'end_lat', 'end_lon']

bikeshare = bikeshare.reindex(columns=reordered_columns)
In [114]:
for i, col in enumerate(bikeshare.columns):
    print('{}'.format(i).ljust(2, " ") + ':' + ' {}'.format(col))
0 : trip_id
1 : bike_id
2 : distance_miles
3 : duration_min
4 : fare
5 : trip_type
6 : bike_type
7 : pass_type
8 : fare_type
9 : start_time
10: year
11: quarter
12: month
13: week
14: weekday
15: day
16: daytime
17: hour
18: end_time
19: start_station_id
20: start_lat
21: start_lon
22: end_station_id
23: end_lat
24: end_lon

2.4 Set ColorBlind Palette:

In [115]:
# display current palette
current_palette = sb.color_palette()
sb.palplot(current_palette)
plt.show()
In [116]:
# set the palette to support 'colorblind'
sb.set_palette(palette = 'colorblind', n_colors = 10, desat = None)
current_palette = sb.color_palette()
sb.palplot(current_palette)
plt.show()
In [117]:
# visually confirm the palette change
current_palette = sb.color_palette()
sb.palplot(current_palette)
plt.show()

3. Explanatory Data Analysis

=========================================


3.1 Does the customers prefer one way trips compare to round trips?

  • Column: trip_type
  • Data type: categorical data, nominal
  • Plot : Bar chart, Point plot, Facet grid

3.1.1 Aggregated distribution of bike rentals based on trip type:

In [118]:
# Assign color palette as per requirement
sb.set_style("white")
sb.set_palette(palette = 'colorblind', n_colors = 10, desat = None)
current_palette = sb.color_palette()
base_color = sb.color_palette()[0]

# prepare data for the plot
trip_type_order = bikeshare.trip_type.value_counts().index
max_count = bikeshare['trip_type'].value_counts().max()
tick_values = np.arange(0, max_count + 100000, 100000)
tick_names = ['{:0.1f} M'.format(v/1000000) for v in tick_values]

# Seaborn's countplot
sb.countplot(data = bikeshare, x = 'trip_type', color = base_color, alpha= 0.5, order = trip_type_order)

# improve plot aesthetics
plt.title('Aggregated distribution of bike rentals based on trip type\n', fontsize = 16, weight='bold')
plt.xlabel('\nTrip type', fontsize = 14)
plt.ylabel('Bike rentals (million)\n', fontsize = 14)
plt.xticks(fontsize = 12)
plt.yticks(tick_values, tick_names, fontsize = 12)

# add annotations
# -------------------------------------------------------
n_points = bikeshare.shape[0]
trip_type_counts = bikeshare['trip_type'].value_counts()

# get the current tick locations and labels
locs, labels = plt.xticks()

# loop through each pair of locations and labels
for loc, label in zip(locs, labels):
    try:
        # get the text property for the label to get the correct count
        count = trip_type_counts[label.get_text()]

    except KeyError:
        count = 0

    pct_string = '{:0.0f}%'.format(100*count/n_points)
    
    # print the annotation depending on the bar length
    if count < (n_points/10):
        plt.text(loc, count + (n_points/20), pct_string, ha = 'center', color = 'black', fontsize = 14)
    else:
        plt.text(loc, count - (n_points/10), pct_string, ha = 'center', color = 'black', fontsize = 14);
# -------------------------------------------------------

sb.despine();

# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.1.1 Aggregated distribution of bike rentals based on trip type.png', dpi=300, bbox_inches='tight')

The above plot depicts that the customers prefer One Way trips compared to Round Trip's for bike rental. However the above plots is graphed based on the overall summation of bike rentals and does not portray any trends/petterns that influence the trip type of the bike rentals over timeline. Hence, let us calculate the average bike rentals distributed over the hour of the day cateforized by trip type.

3.1.2 Average rentals based on the hour of the day over trip type:

In [119]:
# create a dataset for bike rentals over the hour of the day
hours_df = bikeshare.groupby([bikeshare["year"], 
                              bikeshare["month"],
                              bikeshare["day"],
                              bikeshare["hour"],
                              bikeshare["trip_type"]]).count()['trip_id'].reset_index(name='rentals')

hours_df.head(10)
Out[119]:
year month day hour trip_type rentals
0 2017 1 1 0 One Way 6.0
1 2017 1 1 0 Round Trip 3.0
2 2017 1 1 1 One Way 5.0
3 2017 1 1 1 Round Trip NaN
4 2017 1 1 2 One Way 8.0
5 2017 1 1 2 Round Trip NaN
6 2017 1 1 3 One Way 2.0
7 2017 1 1 3 Round Trip NaN
8 2017 1 1 4 One Way 1.0
9 2017 1 1 4 Round Trip NaN

Point plot:

In [120]:
# Assign figure size and color palette
plt.figure(figsize=[8,5])
sb.set_style('white')
flatui = ["#fff480"]
sb.set_palette(flatui, n_colors=1, desat=0.8)
base_color = sb.color_palette()[0]

# Seaborn's pointplot
ax = sb.pointplot(data = hours_df, x = "hour", y = "rentals", linestyles = "--", hue = 'trip_type')

# improve plot aesthetics
# -------------------------------------------------------
plt.title('Hourly average bike rentals categorized by trip type\n',  weight = 'bold', fontsize = 16)
plt.ylabel('Avg. bike rentals\n', fontsize = 14)
plt.xlabel('\nHour of the day', fontsize = 14)
plt.xticks(fontsize = 12)
# get ytick locs and rearrage them with respect to zero
locs, labels = plt.yticks()
hours_rental_avg_max = 70
y_tick_values = np.arange(0, hours_rental_avg_max+10, 10)
y_tick_names = ['{:0.0f}'.format(v) for v in y_tick_values]
plt.yticks(y_tick_values, y_tick_names, fontsize = 12)
sb.despine(top=True, right=True, left=False, bottom=False);
# -------------------------------------------------------

# plot legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, ncol = 1, 
           framealpha = 1, borderpad=1, borderaxespad=1, loc = 'upper left', labelspacing=0.5,  
           title='Trip type', title_fontsize=12, fontsize=10, facecolor='white', 
           markerfirst=True, handlelength=2, handletextpad=0.5)

# plot custom axial grid lines
for loc in y_tick_values:
    plt.axhline(loc, ls='--', color='grey', linewidth=0.5, alpha=0.2)

# add ellipse
# -------------------------------------------------------
ax.add_patch(
    patches.Ellipse(
        (2.5, 3), # (x,y)
        6, # width
        12, # height
        5, # radius
        alpha=0.2, facecolor="grey", edgecolor="lightgrey", linewidth=1, linestyle='solid'
    )
)
# -------------------------------------------------------
    
# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.1.2 Hourly average bike rentals based on the trip type.png', dpi=300, bbox_inches='tight')

The above plot depicts that the average number of One Way trips are higher than the Round trips given any hour between 6:00 AM and 12:00AM in a day. While there exists a grey area where the average number of bike rentals are very low and statistically not significant for comparision.

The above plot is calculated over 3 years summation of the data. Let us look at the individual years to check whether the same trend follow over different years or not?

3.1.3 Average rentals based on the hour of the day by trip type over years:

In [121]:
# Assign palette as per requirement
sb.set_style('white')
flatui = ['#e36297', '#42d4be']
sb.set_palette(flatui, n_colors=2, desat=0.8)

# Facet grid with point plot
plot_order = hours_df.hour.sort_values(ascending=True).unique()
g = sb.FacetGrid(data = hours_df, col = 'year', height = 3, aspect = 1, hue = 'trip_type')
g.map(sb.pointplot, "hour", "rentals", order= plot_order, linestyles = "-", ci = None, markers = ['.']);
g.fig.subplots_adjust(top=0.75)
g.fig.suptitle('Average bike rentals based on hour of the day over years by trip type and bike type', 
               fontsize = 16, weight = 'bold')
g.set_titles('Year = {col_name}', weight = 'bold', size = 12, color = 'dimgrey')

# improve plot aesthetics
# -------------------------------------------------------
g.set_xlabels('\nHour of the day', size = 12)
g.set_ylabels('Avg. bike rentals\n', size = 12)
g.set_yticklabels(size = 12)
# iterate over axes of FacetGrid
for ax in g.axes.flat:
     # get x labels
    labels = ax.get_xticklabels()
    for i,l in enumerate(labels):
        # skip labels
        if not (i%5 == 0): labels[i] = ''
    # set new labels
    ax.set_xticklabels(labels, size = 12)
# -------------------------------------------------------

# add custom legend
# -------------------------------------------------------
custom = [Line2D([], [], marker='.', color=sb.color_palette()[0], linestyle='-', linewidth = 2),
          Line2D([], [], marker='.', color=sb.color_palette()[1], linestyle='-', linewidth = 2)]

plt.legend(custom, ['One Way', 'Round Trip'], scatterpoints=1, frameon=True, fancybox=True, 
           shadow=False, framealpha = 1, borderpad=1, borderaxespad=1, labelspacing=0.5, 
           ncol = 1, title='Trip Type', title_fontsize=12, fontsize=10, facecolor='white', 
           markerfirst=True, handlelength=2, handletextpad=0.5, bbox_to_anchor=(1.1, 1));
# -------------------------------------------------------

plt.subplots_adjust(wspace=0.05, hspace=0.3);

# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.1.3 Distribution of bike rentals based on trip type.png', dpi=300, bbox_inches='tight')

Observation:

It appears the domination of One Way trips over Round Trips are continued in individual years over the hour of the day.

Let us take a look at the other factors that influence the bike trips over time:

3.1.4 Average bike rentals based on day of the week over years by trip type:

In [122]:
# create a dataset for bike rentals over each day in a week
weekday_df = bikeshare.groupby([bikeshare["year"], 
                                bikeshare["month"],
                                bikeshare["week"],
                                bikeshare["weekday"],
                                bikeshare["trip_type"]]).count()['trip_id'].reset_index(name='rentals')
weekday_df.head(10)
Out[122]:
year month week weekday trip_type rentals
0 2017 1 First Monday One Way 228.0
1 2017 1 First Monday Round Trip 31.0
2 2017 1 First Tuesday One Way 288.0
3 2017 1 First Tuesday Round Trip 39.0
4 2017 1 First Wednesday One Way 325.0
5 2017 1 First Wednesday Round Trip 25.0
6 2017 1 First Thursday One Way 211.0
7 2017 1 First Thursday Round Trip 20.0
8 2017 1 First Friday One Way 325.0
9 2017 1 First Friday Round Trip 36.0

Point plot:

In [123]:
# Assign palette as per requirement
sb.set_style('white')
flatui = ['#6b8a99', '#31404a', '#60b6f0']
sb.set_palette(flatui, n_colors=3, desat=0.8)

# Facet grid with point plot
plot_order = weekday_df.weekday.sort_values(ascending=True).unique()
g = sb.FacetGrid(data = weekday_df, col = 'trip_type', height = 4, aspect = 1, hue = 'year')
g.map(sb.pointplot, "weekday", "rentals", order= plot_order, linestyles = "-", ci = None, markers = ['.']);
g.fig.subplots_adjust(top=0.75)
g.fig.suptitle('Average bike rentals based on day of the week over years by trip type', 
               fontsize = 16, weight = 'bold')
g.set_titles('Trip = {col_name}', weight = 'bold', size = 14, color = 'dimgrey')

# improve plot aesthetics
# -------------------------------------------------------
for i, ax in enumerate(g.axes.flat):
    # get x labels
    labels = ax.get_xticklabels()
    ax.set_xticklabels(labels, rotation = 30, size = 12)
            
    if i == 1:
        # Change the transparency of the lines in the second plot
        ax.lines[0].set_alpha(0.6)
        ax.lines[1].set_alpha(0.6)
        ax.lines[2].set_alpha(0.6)
        ax.lines[0].set_markerfacecolor('#9dc0d1')
        ax.lines[0].set_markerfacecolor('#60859e')
        ax.lines[1].set_markerfacecolor('#93d1fa')
        ax.set_facecolor('0.97')
    elif i == 0:
        # Change the transparency of the lines in the first plot
        ax.lines[0].set_alpha(0.8)
        ax.lines[1].set_alpha(0.8)
        ax.lines[0].set_markerfacecolor('#9dc0d1')
        ax.lines[1].set_markerfacecolor('#60859e')
        # Add a inverted triangle marker at desired data point
        ax.lines[2].set_markevery(every=[5,6])
        ax.lines[2].set_marker('v')        
        ax.lines[2].set_markersize(10)
        ax.lines[2].set_markeredgewidth(2)
        ax.lines[2].set_markerfacecolor('orange')
        ax.lines[2].set_markeredgecolor('black')
        
# sort the y_tick_names and assign them as new yticks
g.set_yticklabels(size = 12)
g.set_xlabels('\nDay of the week', size = 14)
g.set_ylabels('Avg. bike rentals\n', size = 14)
plt.subplots_adjust(wspace=0.05, hspace=0.3);
# -------------------------------------------------------

## add custom legend
# -------------------------------------------------------
custom = [Line2D([], [], marker='.', color=sb.color_palette()[0], linestyle='-', linewidth = 2),
          Line2D([], [], marker='.', color=sb.color_palette()[1], linestyle='-', linewidth = 2),
          Line2D([], [], marker='.', color=sb.color_palette()[2], linestyle='-', linewidth = 2)]

plt.legend(custom, ['2017', '2018', '2019'], scatterpoints=1, frameon=True, fancybox=True, 
           shadow=False, framealpha = 1, borderpad=1, borderaxespad=1, labelspacing=0.5, 
           ncol = 1, title='Year', title_fontsize=12, fontsize=10, facecolor='white', 
           markerfirst=True, handlelength=2, handletextpad=0.5, bbox_to_anchor=(1.4, 1));
# -------------------------------------------------------

plt.subplots_adjust(wspace=0.05, hspace=0.3);

# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.1.4 Average bike rentals based on day of the week over years by trip type.png', dpi=300, bbox_inches='tight')

Observations:

  • The above plot depicts that the average number of bike rentals over the day of the week, subjected to One Way trips experiences a sudden drop during weekends say Saturday and Sunday. This drop is especially huge in 2019 compared to other years.
  • While Round Trips experinces a slight increase in average bike rentals over the weekends.

Reforms:

  • Care should be taken to increase the number of bike rentals during the end of the week. Organizing recreational events like Bike rally's will significantly increases the bike rentals during the holidays/weekends.
  • Announing discounts on One Way trips from stations with high bike count to stations with Low bike count during the weekdays will normalize the distribution of bike over all stations.

3.1.5 Average bike rentals based on quarter of the year by trip type:

In [124]:
# create a dataset for bike rentals over each quarter in a year
quarter_df = bikeshare.groupby([bikeshare["year"], 
                                bikeshare["quarter"],
                                bikeshare["trip_type"]]).count()['trip_id'].reset_index(name='rentals')
quarter_df.head(10)
Out[124]:
year quarter trip_type rentals
0 2017 Q1 One Way 30057
1 2017 Q1 Round Trip 3141
2 2017 Q2 One Way 46415
3 2017 Q2 Round Trip 4684
4 2017 Q3 One Way 61084
5 2017 Q3 Round Trip 10458
6 2017 Q4 One Way 58243
7 2017 Q4 Round Trip 11249
8 2018 Q1 One Way 53542
9 2018 Q1 Round Trip 10739

Point plot:

In [125]:
# Assign palette as per requirement
sb.set_style('white')
flatui = ['#6b8a99', '#31404a', '#60b6f0']
sb.set_palette(flatui, n_colors=3, desat=0.8)

# Seaborn's point plot
plot_order = quarter_df.quarter.sort_values(ascending=True).unique()
g = sb.FacetGrid(data = quarter_df, col = 'trip_type', col_wrap = 2, height = 4, aspect = 1, hue = 'year')
g.map(sb.pointplot, "quarter", "rentals", order= plot_order, linestyles = "-", ci = None);

g.fig.subplots_adjust(top=0.75)
g.fig.suptitle('Average bike rentals based on quarter over the years by trip type', fontsize = 16, weight = 'bold')
g.set_titles('{col_name}', weight = 'bold', size = 14, color = 'dimgrey')

# improve plot aesthetics
# -------------------------------------------------------
y_tick_names = []
for i, ax in enumerate(g.axes.flat):
    # get y labels
    for y_label in ax.get_yticklabels():
        y_label_value = int(y_label.get_text().replace('−','-'))
        y_label_new = '{:0.0f} K'.format(y_label_value/1000)
        if y_label_new not in y_tick_names:
            y_tick_names.append(y_label_new)
    if i == 1:
        # Change the transparency of the lines in the second plot
        ax.lines[0].set_alpha(0.6)
        ax.lines[1].set_alpha(0.6)
        ax.lines[2].set_alpha(0.6)
        ax.lines[0].set_markerfacecolor('#9dc0d1')
        ax.lines[0].set_markerfacecolor('#60859e')
        ax.lines[1].set_markerfacecolor('#93d1fa')
        ax.set_facecolor('0.97')
    elif i == 0:
        # Change the transparency of the lines in the first plot
        ax.lines[0].set_alpha(0.8)
        ax.lines[1].set_alpha(0.8)
        ax.lines[0].set_markerfacecolor('#9dc0d1')
        ax.lines[1].set_markerfacecolor('#60859e')
        # Add a inverted triangle marker at desired data point
        ax.lines[2].set_markevery(every=[1])
        ax.lines[2].set_marker('v')        
        ax.lines[2].set_markersize(12)
        ax.lines[2].set_markeredgewidth(2)
        ax.lines[2].set_markerfacecolor('orange')
        ax.lines[2].set_markeredgecolor('black')
        
# sort the y_tick_names and assign them as new yticks
y_tick_names.sort()
g.set_yticklabels(y_tick_names, size = 12)
g.set_xticklabels(plot_order, size = 12)
g.set_xlabels('\nQuarter of the year', size = 14)
g.set_ylabels('Avg. Bike rentals\n', size = 14)
plt.subplots_adjust(wspace=0.05, hspace=0.3);
# -------------------------------------------------------

# add custom legend
# -------------------------------------------------------
custom = [Line2D([], [], marker='.', color=sb.color_palette()[0], linestyle='-', linewidth = 2),
          Line2D([], [], marker='.', color=sb.color_palette()[1], linestyle='-', linewidth = 2),
          Line2D([], [], marker='.', color=sb.color_palette()[2], linestyle='-', linewidth = 2)]

plt.legend(custom, ['2017', '2018', '2019'], scatterpoints=1, frameon=True, fancybox=True, 
           shadow=False, framealpha = 1, borderpad=1, borderaxespad=1, labelspacing=0.5, 
           ncol = 1, title='Year', title_fontsize=12, fontsize=10, facecolor='white', 
           markerfirst=True, handlelength=2, handletextpad=0.5, bbox_to_anchor=(1.4, 1));
# -------------------------------------------------------

# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.1.5 Average bike rentals based on quarter over the years by trip type.png', dpi=300, bbox_inches='tight')

The above plot depicts that the year 2019 experiences a relatively low number of average bike rentals subjected to One Way trips, in the second quarter of the year. Let us take a deeper look at this insight.

3.1.6 Average bike rentals based on month of the year by trip type:

In [126]:
# create a dataset for bike rentals over each hour in a day
month_df = bikeshare.groupby([bikeshare["year"], 
                              bikeshare["month"],
                              bikeshare["trip_type"]]).count()['trip_id'].reset_index(name='rentals')
month_df.head(10)
Out[126]:
year month trip_type rentals
0 2017 1 One Way 9195
1 2017 1 Round Trip 961
2 2017 2 One Way 8557
3 2017 2 Round Trip 811
4 2017 3 One Way 12305
5 2017 3 Round Trip 1369
6 2017 4 One Way 12311
7 2017 4 Round Trip 1324
8 2017 5 One Way 17320
9 2017 5 Round Trip 1704
In [127]:
# Assign palette as per requirement
sb.set_style('white')
flatui = ['#e36297', '#42d4be']
sb.set_palette(flatui, n_colors=2, desat=0.8)

# Facet grid with point plot
plot_order = month_df.month.sort_values(ascending=True).unique()
g = sb.FacetGrid(data = month_df, col = 'year', height = 3, aspect = 1, hue = 'trip_type')
g.map(sb.pointplot, "month", "rentals", order= plot_order, linestyles = "-", ci = None, markers = ['']);
g.fig.subplots_adjust(top=0.75)
g.fig.suptitle('Average bike rentals based on the month over years by trip type', 
               fontsize = 16, weight = 'bold')
g.set_titles('Year = {col_name}', weight = 'bold', size = 12, color = 'dimgrey')

# improve plot aesthetics
# -------------------------------------------------------
g.set_xlabels('\nMonth of the year', size = 12)
g.set_ylabels('Avg. bike rentals\n', size = 12)
y_tick_names = []
# iterate over axes of FacetGrid
for i, ax in enumerate(g.axes.flat):
    # get y labels
    for y_label in ax.get_yticklabels():
        y_label_value = int(y_label.get_text().replace('−','-'))
        y_label_new = '{:0.0f} K'.format(y_label_value/1000)
        if y_label_new not in y_tick_names:
            y_tick_names.append(y_label_new)
    if i != 2:
        # Change the transparency of first line
        ax.lines[0].set_alpha(0.2)
        ax.lines[0].set_markerfacecolor('#ff9ec6')
        ax.set_facecolor('0.97')
    else:
        # Add a inverted triangle marker at desired data point
        ax.lines[0].set_markevery(every=slice(0,5,1))
        ax.lines[0].set_marker('v')        
        ax.lines[0].set_markersize(8)
        ax.lines[0].set_markeredgewidth(1)
        ax.lines[0].set_markerfacecolor('black')
        ax.lines[0].set_markeredgecolor('orange')
    if i != 0:
        ax.axvline(0, ls='--', color='grey', linewidth=1, alpha=0.5)
        sb.despine(left = True, ax=ax)
    # Change the transparency of second line
    ax.lines[1].set_alpha(0.2)
    ax.lines[1].set_markerfacecolor('#97f7e9')   
    # set xlabels fontsize
    labels = ax.get_xticklabels()
    ax.set_xticklabels(labels, size = 12)

# sort the y_tick_names
y_tick_names.sort()
g.set_yticklabels(y_tick_names, size = 12)
# -------------------------------------------------------

# add custom legend
# -------------------------------------------------------
custom = [Line2D([], [], marker='.', color=sb.color_palette()[0], linestyle='-', linewidth = 2),
          Line2D([], [], marker='.', color=sb.color_palette()[1], linestyle='-', linewidth = 2)]

plt.legend(custom, ['One Way', 'Round Trip'], scatterpoints=1, frameon=True, fancybox=True, 
           shadow=False, framealpha = 1, borderpad=1, borderaxespad=1, labelspacing=0.5, 
           ncol = 1, title='Trip Type', title_fontsize=12, fontsize=10, facecolor='white', 
           markerfirst=True, handlelength=2, handletextpad=0.5, bbox_to_anchor=(1.1, 1));
# -------------------------------------------------------

plt.subplots_adjust(wspace=0.0, hspace=0.3);

# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.1.6 Average bike rentals based on the month over the years by trip type.png', dpi=300, bbox_inches='tight')

Observation:

  • It appears that the first half of the year 2019 experiences a relatively low number of bike rentals subjected to One Way trips, compared to later half of the year. This trend is not limited to year 2019 but consistent over the other years.

Reform:

  • Promotions/discounts should be offered on One Way trips over the first half of the year to encourage the customers to take more number of One Way trips.

3.1.7 Average bike rentals over each hour in a day by trip type and bike type:

In [128]:
# create a dataset for bike rentals over each hour in a day by trip type and bike type
hours_df = bikeshare.groupby([bikeshare["year"], 
                              bikeshare["month"],
                              bikeshare["day"],
                              bikeshare["hour"],
                              bikeshare["trip_type"],
                              bikeshare["bike_type"]]).count()['trip_id'].reset_index(name='rentals')
hours_df.head(10)
Out[128]:
year month day hour trip_type bike_type rentals
0 2017 1 1 0 One Way unknown 6.0
1 2017 1 1 0 One Way Standard NaN
2 2017 1 1 0 One Way Electric NaN
3 2017 1 1 0 One Way Smart NaN
4 2017 1 1 0 Round Trip unknown 3.0
5 2017 1 1 0 Round Trip Standard NaN
6 2017 1 1 0 Round Trip Electric NaN
7 2017 1 1 0 Round Trip Smart NaN
8 2017 1 1 1 One Way unknown 5.0
9 2017 1 1 1 One Way Standard NaN
In [129]:
# Assign palette as per requirement
sb.set_style('white')
flatui = ['#e36297', '#42d4be']
sb.set_palette(flatui, n_colors=2, desat=0.8)

# Facet grid with point plot
plot_order = hours_df.hour.sort_values(ascending=True).unique()
g = sb.FacetGrid(data = hours_df, col = 'year', row = 'bike_type', height = 3, aspect = 1, hue = 'trip_type')
g.map(sb.pointplot, "hour", "rentals", order= plot_order, linestyles = "-", ci = None, markers = ['']);
g.fig.subplots_adjust(top=0.85)
g.fig.suptitle('Average bike rentals based on hour of the day over years by trip type and bike type', 
               fontsize = 16, weight = 'bold')
g.set_titles('Bike = {row_name} | Year = {col_name}', weight = 'bold', size = 12, color = 'dimgrey')

# improve plot aesthetics
# -------------------------------------------------------
g.set_xlabels('\nHour of the day', size = 12)
g.set_ylabels('Avg. bike rentals\n', size = 12)
g.set_yticklabels(size = 12)
# iterate over axes of FacetGrid
for i, ax in enumerate(g.axes.flat):
     # get x labels
    labels = ax.get_xticklabels()
    for loc,label in enumerate(labels):
        # skip labels
        if not (loc%5 == 0): labels[loc] = ''
    # set new labels
    ax.set_xticklabels(labels, size = 12)
    if i%3 != 0:
        sb.despine(left = True, ax=ax)
        ax.axvline(0, ls='--', color='grey', linewidth=1, alpha=0.5);
    if i in [0, 1, 2, 3, 6, 9, 10, 11]:
        # Change the transparency of first line
        ax.lines[0].set_alpha(0.4)
        ax.lines[0].set_markerfacecolor('#ff9ec6')
        ax.set_facecolor('0.97')
    if i in [4, 5, 7, 8]: 
        ax.lines[0].set_marker('o')
        ax.lines[0].set_markersize(4)
        ax.lines[0].set_markerfacecolor('#e36297')
    # Change the transparency of second line
    ax.lines[1].set_alpha(0.4)
    ax.lines[1].set_markerfacecolor('#97f7e9') 
# -------------------------------------------------------

# add custom legend
# -------------------------------------------------------
custom = [Line2D([], [], marker='.', color=sb.color_palette()[0], linestyle='-', linewidth = 2),
          Line2D([], [], marker='.', color=sb.color_palette()[1], linestyle='-', linewidth = 2)]

plt.legend(custom, ['One Way', 'Round Trip'], scatterpoints=1, frameon=True, fancybox=True, 
           shadow=False, framealpha = 1, borderpad=1, borderaxespad=1, labelspacing=0.5, 
           ncol = 3, title='Trip Type', title_fontsize=12, fontsize=10, facecolor='white', 
           markerfirst=True, handlelength=2, handletextpad=0.5, bbox_to_anchor=(0, 5.5));
# -------------------------------------------------------

plt.subplots_adjust(wspace=0.05, hspace=0.3);

# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.1.7 Average bike rentals based on hour of the day over years by trip type and bike type.png', dpi=300, bbox_inches='tight')

The above plot depicts that the Standard bike rentals subjected to One Way trips has reduced significantly during 2019 compared to previous year 2018. At the same there is a noticable increase in Electric bike rentals subjected to One Way trips. This depicts that the customers preffered Electric bikes over Standard bikes for One Way trips in the year 2019. Let us take a close look at the reason begind this trend.

3.1.8 Average bike rentals of standard and electric bikes over 2018 and 2019 by trip type:

In [130]:
# create a dataset for bike rentals of standard and electric bikes over 2018 and 2019 by trip type
hours_df = hours_df.query(' (year == 2018 or year == 2019) and (bike_type == "Standard" or bike_type == "Electric")').copy()
level_order = ['Standard', 'Electric']
ordered_cat = pd.api.types.CategoricalDtype(ordered = True, categories = level_order)
hours_df['bike_type'] = hours_df['bike_type'].astype(ordered_cat)
hours_df
Out[130]:
year month day hour trip_type bike_type rentals
71425 2018 1 1 0 One Way Standard NaN
71426 2018 1 1 0 One Way Electric NaN
71429 2018 1 1 0 Round Trip Standard NaN
71430 2018 1 1 0 Round Trip Electric NaN
71433 2018 1 1 1 One Way Standard NaN
... ... ... ... ... ... ... ...
214262 2019 12 31 22 Round Trip Electric 1.0
214265 2019 12 31 23 One Way Standard 5.0
214266 2019 12 31 23 One Way Electric 7.0
214269 2019 12 31 23 Round Trip Standard 2.0
214270 2019 12 31 23 Round Trip Electric 1.0

71424 rows × 7 columns

In [131]:
# Assign palette as per requirement
sb.set_style('white')
flatui = ['#e36297', '#42d4be']
sb.set_palette(flatui, n_colors=2, desat=0.8)

# Facet grid with point plot
plot_order = hours_df.month.sort_values(ascending=True).unique()
g = sb.FacetGrid(data = hours_df, col = 'year', row = 'bike_type', height = 3, aspect = 1, hue = 'trip_type')
g.map(sb.pointplot, "month", "rentals", order= plot_order, linestyles = "-", ci = None, markers = ['.']);
g.fig.subplots_adjust(top=0.85)
g.fig.suptitle('Average bike rentals of standard and electric bikes over 2018 and 2019 by trip type', 
               fontsize = 16, weight = 'bold')
g.set_titles('Bike = {row_name} | Year = {col_name}', weight = 'bold', size = 12, color = 'dimgrey')

# improve plot aesthetics
# -------------------------------------------------------
g.set_xlabels('\nHour of the day', size = 12)
g.set_ylabels('Avg. bike rentals\n', size = 12)
g.set_yticklabels(size = 12)
# iterate over axes of FacetGrid
for i, ax in enumerate(g.axes.flat):
    # set xlabels fontsize
    labels = ax.get_xticklabels()
    ax.set_xticklabels(labels, size = 12)
    if i%2 != 0:
        sb.despine(left=True, ax = ax)
        ax.axvline(0, ls='--', color='grey', linewidth=1, alpha=0.5);
# -------------------------------------------------------

# add custom legend
# -------------------------------------------------------
custom = [Line2D([], [], marker='.', color=sb.color_palette()[0], linestyle='-', linewidth = 2),
          Line2D([], [], marker='.', color=sb.color_palette()[1], linestyle='-', linewidth = 2)]

plt.legend(custom, ['One Way', 'Round Trip'], scatterpoints=1, frameon=True, fancybox=True, 
           shadow=False, framealpha = 1, borderpad=1, borderaxespad=1, labelspacing=0.5, 
           ncol = 1, title='Trip Type', title_fontsize=12, fontsize=10, facecolor='white', 
           markerfirst=True, handlelength=2, handletextpad=0.5, bbox_to_anchor=(1.6, 2.3));
# -------------------------------------------------------

plt.subplots_adjust(wspace=0, hspace=0.3);

# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.1.8 Average bike rentals of standard and electric bikes over 2018 and 2019 by trip type.png', dpi=300, bbox_inches='tight')

As the Electric bike was introduced during the end of the year 2018, the customers that used Standard bikes suddenly shifted towrds Electric bikes after the second quarter of the year 2019. This states that the customer's preference over the bike has changed but the total number of bike rentals subjected to One Way trips has not decreased from the plot 3.1.3.

3.1.9 Distribution of bike rentals over trip type by the fare type:

In [132]:
# create a dataset for bike rentals over each hour in a day by trip type and fare type
hours_df = bikeshare.groupby([bikeshare["year"], 
                              bikeshare["month"],
                              bikeshare["day"],
                              bikeshare["hour"],
                              bikeshare["trip_type"],
                              bikeshare["fare_type"]]).count()['trip_id'].reset_index(name='rentals')
hours_df.head(10)
Out[132]:
year month day hour trip_type fare_type rentals
0 2017 1 1 0 One Way Base 6.0
1 2017 1 1 0 One Way Extended NaN
2 2017 1 1 0 Round Trip Base 1.0
3 2017 1 1 0 Round Trip Extended 2.0
4 2017 1 1 1 One Way Base 5.0
5 2017 1 1 1 One Way Extended NaN
6 2017 1 1 1 Round Trip Base NaN
7 2017 1 1 1 Round Trip Extended NaN
8 2017 1 1 2 One Way Base 7.0
9 2017 1 1 2 One Way Extended 1.0

Point plot:

In [133]:
# Assign palette as per requirement
sb.set_style('white')
flatui = ['#e36297', '#42d4be']
sb.set_palette(flatui, n_colors=2, desat=0.8)

# Facet grid with point plot
plot_order = hours_df.hour.sort_values(ascending=True).unique()
g = sb.FacetGrid(data = hours_df, col = 'year', row = 'fare_type', height = 3, aspect = 1, hue = 'trip_type')
g.map(sb.pointplot, "hour", "rentals", order= plot_order, linestyles = "-", ci = None, markers = ['']);
g.fig.subplots_adjust(top=0.85)
g.fig.suptitle('Average bike rentals based on hour of the day over years by trip type and fare type', 
               fontsize = 15, weight = 'bold')
g.set_titles('Fare = {row_name} | Year = {col_name}', weight = 'bold', size = 13, color = 'dimgrey')

# improve plot aesthetics
# -------------------------------------------------------
g.set_xlabels('\nHour of the day', size = 14)
g.set_ylabels('Avg. bike rentals\n', size = 14)
g.set_yticklabels(size = 12)
# iterate over axes of FacetGrid
for i, ax in enumerate(g.axes.flat):
    # get x labels
    labels = ax.get_xticklabels()
    for i,l in enumerate(labels):
        # skip labels
        if not (i%5 == 0): labels[i] = ''
    # set new labels
    ax.set_xticklabels(labels, size = 12)
    if i in [0, 1, 2]:
        # Change the transparency of first line
        ax.lines[0].set_alpha(0.4)
        ax.lines[1].set_alpha(0.4)
        ax.set_facecolor('0.97')
# -------------------------------------------------------

# add custom legend
# -------------------------------------------------------
custom = [Line2D([], [], marker='.', color=sb.color_palette()[0], linestyle='-', linewidth = 2),
          Line2D([], [], marker='.', color=sb.color_palette()[1], linestyle='-', linewidth = 2)]

plt.legend(custom, ['One Way', 'Round Trip'], scatterpoints=1, frameon=True, fancybox=True, 
           shadow=False, framealpha = 1, borderpad=1, borderaxespad=1, labelspacing=0.5, 
           ncol = 1, title='Trip Type', title_fontsize=12, fontsize=10, facecolor='white', 
           markerfirst=True, handlelength=2, handletextpad=0.5, bbox_to_anchor=(1, 1));
# -------------------------------------------------------

plt.subplots_adjust(wspace=0.05, hspace=0.3);

# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.1.9 Average bike rentals based on hour of the day over years by trip type and fare type.png', dpi=300, bbox_inches='tight')

Observation:

  • The above plot depicts that the customers that pay Extended fares takes almost same number of Round Trips as of One Way trips.
  • While customers that pay Base fare, prefer One Way trips.

Reform:

  • The above behaviour has a positive end result and does not require any intervene. Because if more number of customers take longer rides (as fares and bike durations are correlated) subjected to One Way trips, then the bikes will end up at a farther location from their home station and might create a gap between the supply and demand of the bikes at these stations. However, as more number of customers prefer One Way trips for less duration rides, the bikes will end up in the same geographical cluster and customers can be easily redirected to the nearby availble stations in case of bike deficiency.

----------------------------------------------

3.1.10 Insights:

  1. The aggregated distribution of bike rentals over all years, suggest that the customers prefer One Way trips compared to Round Trip's for bike rental with a grey area in the Early hours of the day, where the average number of bike rentals are very low and statistically not significant for comparision.
  2. The average number of bike rentals subjected to One Way trips decreases during Saturaday's and Sunday's, while Round Trips experiece a slight increase.
  3. The first half of the year 2019 experiences a relatively low number of bike rentals subjected to One Way trips, compared to later half of the year. This trend is not limited to year 2019 but consistent over the other years.
  4. The Standard bike rentals subjected to One Way trips has reduced significantly during 2019 compared to previous year 2018. At the same there is a noticable increase in Electric bike rentals subjected to One Way trips. This depicts that the customers preffered Electric bikes over Standard bikes for One Way trips in the year 2019. However, as the Electric bike was introduced during the end of the year 2018, the customers that used Standard bikes suddenly shifted towrds Electric bikes after the second quarter of the year 2019. This states that the customer's preference over the bike has changed but the total number of bike rentals subjected to One Way trips has not decreased.
  5. Customers that pay Base fare prefer One Way trips, while the customers that pay Extended fares takes almost same number of Round Trips as of One Way trips and does not exhibit any preference over trip types.

3.1.11 Reforms proposed:

  1. Care should be taken to increase the number of bike rentals during the end of the week. Organizing events such as Bike rally's will significantly increases the bike rentals during the holidays/weekends.
  2. Announing discounts on One Way trips from stations with high bike count to stations with Low bike count during the weekdays will normalize the distribution of bike over all stations.
  3. Promotions/discounts should be offered on One Way trips over the first half of the year to encourage the customers to take more number of One Way trips.
  4. Having less number of customers that pay Extended fares subjected to One Way trips has a positive end result and does not require any intervene. Because if more number of customers take longer rides (as fares and bike durations are correlated) subjected to One Way trips, then the bikes will end up at a farther location from their home station and might create a gap between the supply and demand of the bikes at these stations. However, as more number of customers prefer One Way trips for less duration rides (base fare), the bikes will end up in the same geographical cluster which eases the redirection of customers to the nearby available stations in case of bike deficiency.

3.2 Are standard bikes in more demand compared to smart and electric bikes? Is the launch of smart and electric bikes in 2019, considered a success?

  • Column: bike_type
  • Data type: categorical data, nominal
  • Plot : Bar chart, Point plot, Facet grid

3.2.1 Aggregated distribution of bike rentals based on bike type:

In [134]:
# Assign color palette as per requirement
sb.set_style('white')
sb.set_palette(palette = 'colorblind', n_colors = 10, desat = None)
base_color = sb.color_palette()[2]

# prepare data for the plot
bike_type_order = bikeshare.bike_type.value_counts().index
max_count = bikeshare['bike_type'].value_counts().max()
tick_values = np.arange(0, max_count + 100000, 100000)
tick_names = ['{:0.1f} M'.format(v/1000000) for v in tick_values]

# Seaborn's countplot
sb.countplot(data = bikeshare, x = 'bike_type', color = base_color, alpha= 0.5, order = bike_type_order)

# improve plot aesthetics
plt.title('Aggregrated rentals based on bike type\n', fontsize = 16, weight = 'bold')
plt.xlabel('\nBike type', fontsize = 14)
plt.ylabel('Bike rentals (million)\n', fontsize = 14)
plt.xticks(fontsize = 12)
plt.yticks(tick_values, tick_names, fontsize = 12)

# add annotations
# -------------------------------------------------------
n_points = bikeshare.shape[0]
bike_type_counts = bikeshare['bike_type'].value_counts()

# get the current tick locations and labels
locs, labels = plt.xticks()

# loop through each pair of locations and labels
for loc, label in zip(locs, labels):
    try:
        # get the text property for the label to get the correct count
        count = bike_type_counts[label.get_text()]

    except KeyError:
        count = 0

    pct_string = '{:0.0f}%'.format(100*count/n_points)

    # print the annotation depending on the bar length
    if count < (n_points/20):
        plt.text(loc, count + (n_points/40), pct_string, ha = 'center', color = 'black', fontsize = 13)
    else:
        plt.text(loc, count - (n_points/20), pct_string, ha = 'center', color = 'black', fontsize = 13);
# -------------------------------------------------------
    
sb.despine();
    
# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.2.1 Aggregated distribution of bike rentals based on bike type.png', dpi=300, bbox_inches='tight')

The above plot depicts that the standard bikes are in more demand compared to electric and smart bikes. However more than 50% of the bike rentals does not have the bike type labels which makes this data unreliable. Also, the calculation is performed based on the aggregated data of the bike rentals over 3 years and require deeper analysis segmented over each year for any hidden insights.

3.2.2 Aggregated bike rentals based on bike type over years:

In [135]:
# Assign color palette as per requirement
sb.set_style('white')
sb.set_palette(palette = 'colorblind', n_colors = 10, desat = None)
base_color = sb.color_palette()[2]

# prepare data for the plot
bike_type_order = bikeshare.bike_type.value_counts().index
max_count = bikeshare['bike_type'].value_counts().max()
tick_values = np.arange(0, max_count + 100000, 100000)
tick_names = ['{:0.1f} M'.format(v/1000000) for v in tick_values]

# Seaborn's countplot
g = sb.FacetGrid(data = bikeshare, col = 'year', height = 4.5, aspect = 0.8)
g.map(sb.countplot, 'bike_type', color = base_color, alpha= 0.5, order = bike_type_order);

# improve plot aesthetics
# -------------------------------------------------------
y_tick_names = []
# iterate over axes of FacetGrid
for i, ax in enumerate(g.axes.flat):
    # get y labels
    for y_label in ax.get_yticklabels():
        y_label_value = int(y_label.get_text().replace('−','-'))
        y_label_new = '{:0.0f} K'.format(y_label_value/1000)
        if y_label_new not in y_tick_names:
            y_tick_names.append(y_label_new)
            
g.fig.subplots_adjust(top=0.75)
g.fig.suptitle('Aggregated bike rentals based on bike type over the years', fontsize = 16, weight = 'bold')
g.set_titles('Year = {col_name}', weight = 'bold', size = 14, color = 'dimgrey')
g.set_xlabels('\nBike Type', size = 12)
g.set_ylabels('Bike rentals\n', size = 12) 
g.set_xticklabels(size = 12)
# sort the y_tick_names and assign them as new yticks
y_tick_names.sort()
g.set_yticklabels(y_tick_names, size = 12)
# -------------------------------------------------------


# add annotations
# -------------------------------------------------------
total_count = bikeshare.shape[0]
for ax in g.axes.ravel(): # loops over the different figures in the grid 
    for i, p in enumerate(ax.patches): # loops over the different bars in each figure 
        ax.annotate('{:0.1f}%'.format(100*p.get_height()/total_count), (p.get_x() + p.get_width() / 2., p.get_height()), 
                    ha = 'center', va = 'center', xytext = (0, 10), textcoords = 'offset points', fontsize = 12)
        # fade out the bars related to unknown  bike type
        if i == 0:
            p.set_alpha(0.2);
# ------------------------------------------------------

# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.2.2 Aggregated bike rentals based on bike type over years.png', dpi=300, bbox_inches='tight')

The above plot depicts that the classification of bikes was introduced somewhere in the year 2018, which is the reason for the existance of the bikes with the unknown label in the plots subjected to years 2017 and 2018. But the timeline at which the classification of bike rentals was introduced in the year 2018, is not clear and requires further analysis, as whether to include or exclude the rentals subjected to year 2018 in the further analysis.

3.2.3 Average bike rentals based on month over years by trip type and bike type:

In [136]:
# create a dataset for bike rentals over each month for all years
months_df = bikeshare.groupby([bikeshare["year"], 
                              bikeshare["month"],
                              bikeshare["trip_type"],
                              bikeshare["bike_type"]]).count()['trip_id'].reset_index(name='rentals')
months_df.head(10)
Out[136]:
year month trip_type bike_type rentals
0 2017 1 One Way unknown 9195.0
1 2017 1 One Way Standard NaN
2 2017 1 One Way Electric NaN
3 2017 1 One Way Smart NaN
4 2017 1 Round Trip unknown 961.0
5 2017 1 Round Trip Standard NaN
6 2017 1 Round Trip Electric NaN
7 2017 1 Round Trip Smart NaN
8 2017 2 One Way unknown 8557.0
9 2017 2 One Way Standard NaN

Facet grid:

In [137]:
def plot_rectangle(ax, x, y, width, height):
    '''plots rectangular patch in the specified axis'''
    ax.add_patch(
        patches.Rectangle(
            (x,y),
            width,
            height,
            # You can add rotation with 'angle'
            alpha=0.25, facecolor="gold", edgecolor="gold", linewidth=1, linestyle='solid'
        )
    )


# Assign palette as per requirement
sb.set_style('white')
flatui = ['#ff7ddd', '#4b99eb', '#77f7cc', '#aa75fa']
sb.set_palette(flatui, n_colors=4, desat=0.6)

# Facet grid with point plot
plot_order = months_df.month.sort_values(ascending=True).unique()
g = sb.FacetGrid(data = months_df, col = 'year', row = 'trip_type', height = 4, aspect = 1, hue = 'bike_type')
g.map(sb.pointplot, "month", "rentals", order= plot_order, linestyles = "-", ci = None, markers = ['.']);
g.fig.subplots_adjust(top=0.85)
g.fig.suptitle('Average bike rentals based on month over years by trip type and bike type', 
               fontsize = 16, weight = 'bold')
g.set_titles('Trip Type = {row_name} | Year = {col_name}', weight = 'bold', size = 12, color = 'dimgrey')

# improve plot aesthetics
# -------------------------------------------------------
y_tick_names = []
for i, ax in enumerate(g.axes.flat):
    # get y labels
    for y_label in ax.get_yticklabels():
        y_label_value = int(y_label.get_text().replace('−','-'))
        y_label_new = '{:0.0f} K'.format(y_label_value/1000)
        if y_label_new not in y_tick_names:
            y_tick_names.append(y_label_new)
    # get x labels
    xlabels = ax.get_xticklabels()
    ax.set_xticklabels(xlabels, size = 12)
    if i not in [0, 3]:
        sb.despine(left=True, ax = ax)
            
# sort the y_tick_names and assign them as new yticks
y_tick_names.sort()
g.set_yticklabels(y_tick_names, size = 12)
g.set_xlabels('\nMonth of the year', size = 14)
g.set_ylabels('Avg. bike rentals\n', size = 14)
plt.subplots_adjust(wspace=0.05, hspace=0.3);
# -------------------------------------------------------

# add custom legend
# -------------------------------------------------------
custom = [Line2D([], [], marker='.', color=sb.color_palette()[0], linestyle='-', linewidth = 2),
          Line2D([], [], marker='.', color=sb.color_palette()[1], linestyle='-', linewidth = 2),
          Line2D([], [], marker='.', color=sb.color_palette()[2], linestyle='-', linewidth = 2),
          Line2D([], [], marker='.', color=sb.color_palette()[3], linestyle='-', linewidth = 2)]
labels = months_df.bike_type.sort_values(ascending=True).unique()
plt.legend(custom, labels, scatterpoints=1, frameon=True, fancybox=True, 
           shadow=False, framealpha = 1, borderpad=1, borderaxespad=1, labelspacing=0.5, 
           ncol = 1, title='Year', title_fontsize=12, fontsize=10, facecolor='white', 
           markerfirst=True, handlelength=2, handletextpad=0.5, bbox_to_anchor=(1, 0.9));
# -------------------------------------------------------

plt.subplots_adjust(wspace=0.05, hspace=0.3);


# Add rectangles to highlight the area of interest
# -------------------------------------------------------
ax1 = g.facet_axis(0,1)
ax2 = g.facet_axis(1,1)

plot_rectangle(ax = ax1, x=8, y=0, width = 3, height = 30000)
plot_rectangle(ax = ax2, x=8, y=0, width = 3, height = 10000);
# -------------------------------------------------------


# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.2.3 Average bike rentals based on month over years by trip type and bike type.png', dpi=300, bbox_inches='tight')

The yellow spots in the above plots depicts that the classification of bikes was introduced at the end of the year 2018. Hence the rentals related to unknown bike category subjected to the year 2018 can be ignored and limit the analysis mostly to the year 2019 in the further plots for clear insights.

3.2.4 Average rentals based on the daytime:

In [138]:
month_df = bikeshare.groupby([bikeshare["year"], 
                              bikeshare["month"],
                              bikeshare["bike_type"]]).count()['trip_id'].reset_index(name='rentals')

# month_df['rentals'] = month_df['rentals'].fillna(0).astype(int)
month_df = month_df.query(' year == 2019 ')
level_order = ['Standard', 'Electric', 'Smart']
ordered_cat = pd.api.types.CategoricalDtype(ordered = True, categories = level_order)
month_df['bike_type'] = month_df['bike_type'].astype(ordered_cat)
month_df.head(10)
Out[138]:
year month bike_type rentals
96 2019 1 NaN NaN
97 2019 1 Standard 18021.0
98 2019 1 Electric 1142.0
99 2019 1 Smart NaN
100 2019 2 NaN NaN
101 2019 2 Standard 15609.0
102 2019 2 Electric 991.0
103 2019 2 Smart 18.0
104 2019 3 NaN NaN
105 2019 3 Standard 17867.0

Point plot:

In [139]:
# Assign figure size and color palette
plt.figure(figsize=[8,5])
sb.set_style('white')
flatui = ['#4b99eb', '#77f7cc', '#aa75fa']
sb.set_palette(flatui, n_colors=3, desat=0.6)
base_color = sb.color_palette()[0]

# Seaborn's pointplot
ax = sb.pointplot(data = month_df, x = "month", y = "rentals", linestyles = "-", hue = 'bike_type', 
                  scale = 1, ci = None)

# improve plot aesthetics
# -------------------------------------------------------
plt.title('Average monthly bike rentals categorized by bike type in 2019\n\n',  weight = 'bold', fontsize = 16)
plt.ylabel('Avg. bike rentals (thousands)\n', fontsize = 14)
plt.xlabel('\nMonth of the year', fontsize = 14)
plt.xticks(fontsize = 12)
# get ytick locs and rearrage them with respect to zero
locs, labels = plt.yticks()
monthly_rental_avg_max = locs.max()
y_tick_values = np.arange(0, monthly_rental_avg_max+5000, 5000)
y_tick_names = ['{:0.0f} K'.format(v/1000) for v in y_tick_values]
plt.yticks(y_tick_values, y_tick_names, fontsize = 12)
sb.despine(top=True, right=True, left=False, bottom=False);
# -------------------------------------------------------

# plot legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, ncol = 1, 
           framealpha = 1, borderpad=1, borderaxespad=1, loc = 'upper right', labelspacing=0.5,  
           title='Bike type', title_fontsize=12, fontsize=10, facecolor='white', 
           markerfirst=True, handlelength=2, handletextpad=0.5, bbox_to_anchor=(1.3, 1))

# plot custom axial grid lines
for loc in y_tick_values:
    plt.axhline(loc, ls='--', color='grey', linewidth=0.5, alpha=0.5)

    
# add ellipse
# -------------------------------------------------------
ax.add_patch(
    patches.Ellipse(
        (8, 12000), # (x,y)
        7, # width
        12000, # height
        0, # radius
        alpha=0.2, facecolor="gold", edgecolor="gold", linewidth=1, linestyle='solid'
    )
);
# -------------------------------------------------------

# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.2.4 Average monthly bike rentals categorized by bike type in 2019.png', dpi=300, bbox_inches='tight')

Observation:

The above plot depicts that the bike rentals for the Standard bike type decreases over the year 2019, while the rentals for the bike type Smart and Electric increases with in the timeline of the year. Hence, even though Standard bikes are popular during the start of the year 2019, customers preferred Smart and Electric bikes towards the end of the year 2019. Hence it can be concluded that the lauch of Electric and Smart bikes are a success.

Let us take a look at the other factors that influence the bike type over time:

3.2.5 Average hourly bike rentals categorized by bike type in 2019:

In [140]:
# create a dataset for bike rentals over each hour in a day in the year 2019
temp_df = bikeshare.query(' year == 2019 ').copy()

level_order = ['Standard', 'Electric', 'Smart']
ordered_cat = pd.api.types.CategoricalDtype(ordered = True, categories = level_order)
temp_df['bike_type'] = temp_df['bike_type'].astype(ordered_cat)

hours_df = temp_df.groupby([temp_df["month"],
                            temp_df["day"],
                            temp_df["hour"],
                            temp_df["bike_type"]]).count()['trip_id'].reset_index(name='rentals')

hours_df.head(10)
Out[140]:
month day hour bike_type rentals
0 1 1 0 Standard 19.0
1 1 1 0 Electric NaN
2 1 1 0 Smart NaN
3 1 1 1 Standard 8.0
4 1 1 1 Electric NaN
5 1 1 1 Smart NaN
6 1 1 2 Standard 16.0
7 1 1 2 Electric NaN
8 1 1 2 Smart NaN
9 1 1 3 Standard 2.0

Facet grid:

In [141]:
# Assign figure size and color palette
plt.figure(figsize=[8,5])
sb.set_style('white')
flatui = ['#4b99eb', '#77f7cc', '#aa75fa']
sb.set_palette(flatui, n_colors=3, desat=0.6)
base_color = sb.color_palette()[0]

# Seaborn's pointplot
ax = sb.pointplot(data = hours_df, x = "hour", y = "rentals", linestyles = "-", hue = 'bike_type', 
                  scale = 1, ci = None)

# improve plot aesthetics
# -------------------------------------------------------
plt.title('Average hourly bike rentals categorized by bike type in 2019\n',  weight = 'bold', fontsize = 16)
plt.ylabel('Avg. bike rentals (thousands)\n', fontsize = 14)
plt.xlabel('\nHour of the day (year 2019)', fontsize = 14)
plt.xticks(fontsize = 12)
plt.yticks(fontsize = 12)
sb.despine(top=True, right=True, left=False, bottom=False);
# -------------------------------------------------------

# plot legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, ncol = 1, 
           framealpha = 1, borderpad=1, borderaxespad=1, loc = 'upper right', labelspacing=0.5,  
           title='Bike Type', title_fontsize=12, fontsize=10, facecolor='white', 
           markerfirst=True, handlelength=2, handletextpad=0.5, bbox_to_anchor=(1.25, 1));

# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.2.5 Average hourly bike rentals categorized by bike type in 2019.png', dpi=300, bbox_inches='tight')
  • The above plot depicts that the customers prefer the Standard bike between 7:00 AM to 9:00 AM, and 3:00 PM to 5:00 PM, which are office reporting times and relieving times. Which conveys that the working individuals preferred Standard bikes to ride to their work locations and getting back home after work.
  • While Electric bikes are as popular as Standard bikes over time, they are particularly preferred between 7:00 PM to 12:00 AM.

Is the trend influenced by other variables? Let us take a deeper analysis for any hidden trends.

3.2.6 Average hourly bike rentals categorized by bike type and trip type in 2019:

In [142]:
# create a dataset for bike rentals over each hour in the day by trip type and bike type in the year 2019
temp_df = bikeshare.query(' year == 2019 ').copy()

level_order = ['Standard', 'Electric', 'Smart']
ordered_cat = pd.api.types.CategoricalDtype(ordered = True, categories = level_order)
temp_df['bike_type'] = temp_df['bike_type'].astype(ordered_cat)

hours_df = temp_df.groupby([temp_df["month"],
                            temp_df["day"],
                            temp_df["hour"],
                            temp_df["trip_type"],
                            temp_df["bike_type"]]).count()['trip_id'].reset_index(name='rentals')

hours_df.head(10)
Out[142]:
month day hour trip_type bike_type rentals
0 1 1 0 One Way Standard 18.0
1 1 1 0 One Way Electric NaN
2 1 1 0 One Way Smart NaN
3 1 1 0 Round Trip Standard 1.0
4 1 1 0 Round Trip Electric NaN
5 1 1 0 Round Trip Smart NaN
6 1 1 1 One Way Standard 6.0
7 1 1 1 One Way Electric NaN
8 1 1 1 One Way Smart NaN
9 1 1 1 Round Trip Standard 2.0

Facet grid:

In [143]:
# Assign palette as per requirement
sb.set_style('white')
flatui = ['#4b99eb', '#77f7cc', '#aa75fa']
sb.set_palette(flatui, n_colors=3, desat=0.6)

# Facet grid with point plot
plot_order = hours_df.hour.sort_values(ascending=True).unique()
g = sb.FacetGrid(data = hours_df, col = 'trip_type', col_wrap = 3, height = 3.5, aspect = 1, hue = 'bike_type')
g.map(sb.pointplot, "hour", "rentals", order= plot_order, linestyles = "-", ci = None, markers = ['']);
g.fig.subplots_adjust(top=0.75)
g.fig.suptitle('Average hourly bike rentals by bike type and trip type in 2019', fontsize = 15, weight = 'bold')
g.set_titles('Pass = {col_name}', weight = 'bold', size = 13, color = 'dimgrey')

# improve plot aesthetics
# -------------------------------------------------------
g.set_xlabels('\nHour of the day', size = 14)
g.set_ylabels('Avg. bike rentals\n', size = 14)
g.set_yticklabels(size = 12)
# iterate over axes of FacetGrid
for i, ax in enumerate(g.axes.flat):
    # get x labels
    labels = ax.get_xticklabels()
    for i,l in enumerate(labels):
        # skip labels
        if not (i%5 == 0): labels[i] = ''
    # set new labels
    ax.set_xticklabels(labels, size = 12)
# -------------------------------------------------------


# set transperancy of the redundant lines in each axis
# -------------------------------------------------------
ax1 = g.facet_axis(0,0)
# ax1.lines[1].set_alpha(0.4)
# ax1.lines[2].set_alpha(0.4)
ax2 = g.facet_axis(0,1)
# ax2.lines[0].set_alpha(0.4)
# ax2.lines[2].set_alpha(0.4)
# -------------------------------------------------------


# add custom legend
# -------------------------------------------------------
custom = [Line2D([], [], marker='.', color=sb.color_palette()[0], linestyle='-', linewidth = 2),
          Line2D([], [], marker='.', color=sb.color_palette()[1], linestyle='-', linewidth = 2),
          Line2D([], [], marker='.', color=sb.color_palette()[2], linestyle='-', linewidth = 2)]

plt.legend(custom, ['Standard', 'Electric', 'Smart'], scatterpoints=1, frameon=True, fancybox=True, 
           shadow=False, framealpha = 1, borderpad=1, borderaxespad=1, labelspacing=0.5, 
           ncol = 1, title='Bike Type', title_fontsize=12, fontsize=10, facecolor='white', 
           markerfirst=True, handlelength=2, handletextpad=0.5, bbox_to_anchor=(1.5, 1));
# -------------------------------------------------------

plt.subplots_adjust(wspace=0.05, hspace=0.3);

# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.2.6 Average hourly bike rentals by bike type and trip type in 2019.png', dpi=300, bbox_inches='tight')

It appears that the trend is consistent in One Way trips while the customers that take Round Trips does not have any preference in bikes. However the plot is calculated based on the summation of all rentals over the year 2019. Is the same trend consistent over the year 2019? Let us calculate the average bike rentals over the month of the year 2019 for deeper insights.

3.2.7 Average hourly bike rentals categorized by bike type and trip type over the quarters of 2019:

In [144]:
# create a dataset for bike rentals over each hour in the day by trip type and bike type over quarters in 2019
temp_df = bikeshare.query(' year == 2019 ').copy()

level_order = ['Standard', 'Electric', 'Smart']
ordered_cat = pd.api.types.CategoricalDtype(ordered = True, categories = level_order)
temp_df['bike_type'] = temp_df['bike_type'].astype(ordered_cat)

hours_df = temp_df.groupby([temp_df["quarter"],
                            temp_df["month"],
                            temp_df["day"],
                            temp_df["hour"],
                            temp_df["trip_type"],
                            temp_df["bike_type"]]).count()['trip_id'].reset_index(name='rentals')

hours_df.head(10)
Out[144]:
quarter month day hour trip_type bike_type rentals
0 Q1 1 1 0 One Way Standard 18.0
1 Q1 1 1 0 One Way Electric NaN
2 Q1 1 1 0 One Way Smart NaN
3 Q1 1 1 0 Round Trip Standard 1.0
4 Q1 1 1 0 Round Trip Electric NaN
5 Q1 1 1 0 Round Trip Smart NaN
6 Q1 1 1 1 One Way Standard 6.0
7 Q1 1 1 1 One Way Electric NaN
8 Q1 1 1 1 One Way Smart NaN
9 Q1 1 1 1 Round Trip Standard 2.0

Facet Grid:

In [145]:
# Assign palette as per requirement
sb.set_style('white')
flatui = ['#4b99eb', '#77f7cc', '#aa75fa']
sb.set_palette(flatui, n_colors=3, desat=0.6)

# Facet grid with point plot
plot_order = hours_df.hour.sort_values(ascending=True).unique()
g = sb.FacetGrid(data = hours_df, col = 'quarter', row = 'trip_type', height = 3, aspect = 1, hue = 'bike_type')
g.map(sb.pointplot, "hour", "rentals", order= plot_order, linestyles = "-", ci = None, markers = ['']);
g.fig.subplots_adjust(top=0.85)
g.fig.suptitle('Average bike rentals based on hour of the day by bike type and trip type over quarters of 2019', 
               fontsize = 15, weight = 'bold')
g.set_titles('Trip = {row_name} | Quarter = {col_name}', weight = 'bold', size = 13, color = 'dimgrey')

# improve plot aesthetics
# -------------------------------------------------------
g.set_xlabels('\nHour of the day', size = 14)
g.set_ylabels('Avg. bike rentals\n', size = 14)
g.set_yticklabels(size = 12)
# iterate over axes of FacetGrid
for i, ax in enumerate(g.axes.flat):
    # get x labels
    labels = ax.get_xticklabels()
    for i,l in enumerate(labels):
        # skip labels
        if not (i%5 == 0): labels[i] = ''
    # set new labels
    ax.set_xticklabels(labels, size = 12)
# -------------------------------------------------------


# set transperancy of the redundant lines in each axis
# -------------------------------------------------------
ax1 = g.facet_axis(0,0)
ax1.lines[1].set_alpha(0.4)
ax1.lines[2].set_alpha(0.4)
ax2 = g.facet_axis(0,1)
ax2.lines[2].set_alpha(0.4)
ax3 = g.facet_axis(0,2)
ax3.lines[0].set_alpha(0.4)
ax3.lines[2].set_alpha(0.4)
ax4 = g.facet_axis(0,3)
ax4.lines[0].set_alpha(0.4)
ax4.lines[2].set_alpha(0.4)

ax5 = g.facet_axis(1,0)
ax6 = g.facet_axis(1,1)
ax7 = g.facet_axis(1,2)
ax8 = g.facet_axis(1,3)
for ax in [ax5, ax6, ax7, ax8]:
    ax.lines[0].set_alpha(0.4)
    ax.lines[1].set_alpha(0.4)
    ax.lines[2].set_alpha(0.4)
# -------------------------------------------------------


# add custom legend
# -------------------------------------------------------
custom = [Line2D([], [], marker='.', color=sb.color_palette()[0], linestyle='-', linewidth = 2),
          Line2D([], [], marker='.', color=sb.color_palette()[1], linestyle='-', linewidth = 2),
          Line2D([], [], marker='.', color=sb.color_palette()[2], linestyle='-', linewidth = 2)]

plt.legend(custom, ['Standard', 'Electric', 'Smart'], scatterpoints=1, frameon=True, fancybox=True, 
           shadow=False, framealpha = 1, borderpad=1, borderaxespad=1, labelspacing=0.5, 
           ncol = 1, title='Bike Type', title_fontsize=12, fontsize=10, facecolor='white', 
           markerfirst=True, handlelength=2, handletextpad=0.5, bbox_to_anchor=(1, 1));
# -------------------------------------------------------

plt.subplots_adjust(wspace=0.05, hspace=0.3);

# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.2.7 Average hourly bike rentals by bike type and trip type in quarter of 2019.png', dpi=300, bbox_inches='tight')

Observations:

  • Even though Standard bikes are the most popular choice during the first quarter of the year, the Electric bikes gradually gained popularity among One Way trips over the rest of the year.
  • While customers that take Round Trips does not have any preference over bike types.

3.2.8 Average hourly bike rentals categorized by bike type and pass type in 2019:

In [146]:
# create a dataset for bike rentals over each hour in the day by pass type and bike type in the year 2019
temp_df = bikeshare.query(' year == 2019 ').copy()

level_order = ['Standard', 'Electric', 'Smart']
ordered_cat = pd.api.types.CategoricalDtype(ordered = True, categories = level_order)
temp_df['bike_type'] = temp_df['bike_type'].astype(ordered_cat)
level_order = ['One Day', 'Monthly', 'Annual']
ordered_cat = pd.api.types.CategoricalDtype(ordered = True, categories = level_order)
temp_df['pass_type'] = temp_df['pass_type'].astype(ordered_cat)

hours_df = temp_df.groupby([temp_df["month"],
                            temp_df["day"],
                            temp_df["hour"],
                            temp_df["pass_type"],
                            temp_df["bike_type"]]).count()['trip_id'].reset_index(name='rentals')

hours_df.head(10)
Out[146]:
month day hour pass_type bike_type rentals
0 1 1 0 One Day Standard 18.0
1 1 1 0 One Day Electric NaN
2 1 1 0 One Day Smart NaN
3 1 1 0 Monthly Standard 1.0
4 1 1 0 Monthly Electric NaN
5 1 1 0 Monthly Smart NaN
6 1 1 0 Annual Standard NaN
7 1 1 0 Annual Electric NaN
8 1 1 0 Annual Smart NaN
9 1 1 1 One Day Standard 7.0

Facet Grid:

In [147]:
# Assign palette as per requirement
sb.set_style('white')
flatui = ['#4b99eb', '#77f7cc', '#aa75fa']
sb.set_palette(flatui, n_colors=3, desat=0.6)

# Facet grid with point plot
plot_order = hours_df.hour.sort_values(ascending=True).unique()
g = sb.FacetGrid(data = hours_df, col = 'pass_type', col_wrap = 3, height = 3.5, aspect = 1, hue = 'bike_type')
g.map(sb.pointplot, "hour", "rentals", order= plot_order, linestyles = "-", ci = None, markers = ['']);
g.fig.subplots_adjust(top=0.75)
g.fig.suptitle('Average hourly bike rentals by bike type and pass type in 2019', fontsize = 15, weight = 'bold')
g.set_titles('Pass = {col_name}', weight = 'bold', size = 13, color = 'dimgrey')


# improve plot aesthetics
# -------------------------------------------------------
# iterate over axes of FacetGrid
for i, ax in enumerate(g.axes.flat):
    # get x labels
    labels = ax.get_xticklabels()
    for i,l in enumerate(labels):
        # skip labels
        if not (i%5 == 0): labels[i] = ''
    # set new labels
    ax.set_xticklabels(labels, size = 12)
    
g.set_yticklabels(size = 12)
g.set_xlabels('\nHour of the day', size = 13)
g.set_ylabels('Avg. bike rentals\n', size = 13)
# -------------------------------------------------------


# set transperancy of the redundant lines in each axis
# -------------------------------------------------------
ax1 = g.facet_axis(0,0)
ax1.lines[1].set_alpha(0.4)
ax1.lines[2].set_alpha(0.4)
ax2 = g.facet_axis(0,1)
ax2.lines[0].set_alpha(0.4)
ax2.lines[2].set_alpha(0.4)
ax3 = g.facet_axis(0,2)
ax3.lines[0].set_alpha(0.4)
ax3.lines[1].set_alpha(0.4)
ax3.lines[2].set_alpha(0.4)
# -------------------------------------------------------


# add custom legend
# -------------------------------------------------------
custom = [Line2D([], [], marker='.', color=sb.color_palette()[0], linestyle='-', linewidth = 2),
          Line2D([], [], marker='.', color=sb.color_palette()[1], linestyle='-', linewidth = 2),
          Line2D([], [], marker='.', color=sb.color_palette()[2], linestyle='-', linewidth = 2)]

plt.legend(custom, ['Standard', 'Electric', 'Smart'], scatterpoints=1, frameon=True, fancybox=True, 
           shadow=False, framealpha = 1, borderpad=1, borderaxespad=1, labelspacing=0.5, 
           ncol = 1, title='Bike Type', title_fontsize=12, fontsize=10, facecolor='white', 
           markerfirst=True, handlelength=2, handletextpad=0.5, bbox_to_anchor=(1.5, 1));
# -------------------------------------------------------

plt.subplots_adjust(wspace=0.05, hspace=0.3);

# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.2.8 Average hourly bike rentals by bike type and pass type in 2019.png', dpi=300, bbox_inches='tight')

The above plot depicts that the customers with One Day pass prefer Standard bikes while customers with Monthly pass prefer Electric bikes. As the number of bike rentals subjectd to Annual pass are very low, the bike preference of customers with Annual pass is not evaluated.

However the plot is calculated based on the summation of all rentals over the year 2019. Is the same trend consistent over the year 2019? Let us calculate the average bike rentals over the month of the year 2019 for deeper insights.

3.2.9 Average hourly bike rentals categorized by bike type and pass type over the quarters of 2019:

In [148]:
# create a dataset for bike rentals over each hour in the day by pass type and bike type over quarters in 2019
temp_df = bikeshare.query(' year == 2019 ').copy()

level_order = ['Standard', 'Electric', 'Smart']
ordered_cat = pd.api.types.CategoricalDtype(ordered = True, categories = level_order)
temp_df['bike_type'] = temp_df['bike_type'].astype(ordered_cat)
level_order = ['One Day', 'Monthly', 'Annual']
ordered_cat = pd.api.types.CategoricalDtype(ordered = True, categories = level_order)
temp_df['pass_type'] = temp_df['pass_type'].astype(ordered_cat)

hours_df = temp_df.groupby([temp_df["quarter"],
                            temp_df["month"],
                            temp_df["day"],
                            temp_df["hour"],
                            temp_df["pass_type"],
                            temp_df["bike_type"]]).count()['trip_id'].reset_index(name='rentals')

hours_df.head(10)
Out[148]:
quarter month day hour pass_type bike_type rentals
0 Q1 1 1 0 One Day Standard 18.0
1 Q1 1 1 0 One Day Electric NaN
2 Q1 1 1 0 One Day Smart NaN
3 Q1 1 1 0 Monthly Standard 1.0
4 Q1 1 1 0 Monthly Electric NaN
5 Q1 1 1 0 Monthly Smart NaN
6 Q1 1 1 0 Annual Standard NaN
7 Q1 1 1 0 Annual Electric NaN
8 Q1 1 1 0 Annual Smart NaN
9 Q1 1 1 1 One Day Standard 7.0

Facet grid:

In [149]:
# Assign palette as per requirement
sb.set_style('white')
flatui = ['#4b99eb', '#77f7cc', '#aa75fa']
sb.set_palette(flatui, n_colors=3, desat=0.6)

# Facet grid with point plot
plot_order = hours_df.hour.sort_values(ascending=True).unique()
g = sb.FacetGrid(data = hours_df, col = 'quarter', row = 'pass_type', height = 3, aspect = 1, hue = 'bike_type')
g.map(sb.pointplot, "hour", "rentals", order= plot_order, linestyles = "-", ci = None, markers = ['']);
g.fig.subplots_adjust(top=0.8)
g.fig.suptitle('Average bike rentals based on hour of the day by bike type and pass type over quarters of 2019', 
               fontsize = 15, weight = 'bold')
g.set_titles('Trip = {row_name} | Quarter = {col_name}', weight = 'bold', size = 13, color = 'dimgrey')

# improve plot aesthetics
# -------------------------------------------------------
g.set_xlabels('\nHour of the day (year 2019)', size = 13)
g.set_ylabels('Avg. bike rentals\n', size = 13)
g.set_yticklabels(size = 12)
# iterate over axes of FacetGrid
for i, ax in enumerate(g.axes.flat):
    # get x labels
    labels = ax.get_xticklabels()
    for i,l in enumerate(labels):
        # skip labels
        if not (i%5 == 0): labels[i] = ''
    # set new labels
    ax.set_xticklabels(labels, size = 12)
# -------------------------------------------------------


# set transperancy of the redundant lines in each axis
# -------------------------------------------------------
ax1 = g.facet_axis(0,0)
ax2 = g.facet_axis(0,1)
ax3 = g.facet_axis(0,2)
ax4 = g.facet_axis(0,3)
for ax in [ax1, ax2, ax3, ax4]:
    ax.lines[1].set_alpha(0.4)
    ax.lines[2].set_alpha(0.4)

ax5 = g.facet_axis(1,0)
ax6 = g.facet_axis(1,1)
ax7 = g.facet_axis(1,2)
ax8 = g.facet_axis(1,3)
for ax in [ax5, ax6, ax7, ax8]:
    ax.lines[0].set_alpha(0.4)
    ax.lines[2].set_alpha(0.4)
    
ax9 = g.facet_axis(2,0)
ax10 = g.facet_axis(2,1)
ax11 = g.facet_axis(2,2)
ax12 = g.facet_axis(2,3)
for ax in [ax9, ax10, ax11, ax12]:
    ax.lines[0].set_alpha(0.4)
    ax.lines[1].set_alpha(0.4)
    ax.lines[2].set_alpha(0.4)
# -------------------------------------------------------


# add custom legend
# -------------------------------------------------------
custom = [Line2D([], [], marker='.', color=sb.color_palette()[0], linestyle='-', linewidth = 2),
          Line2D([], [], marker='.', color=sb.color_palette()[1], linestyle='-', linewidth = 2),
          Line2D([], [], marker='.', color=sb.color_palette()[2], linestyle='-', linewidth = 2)]

plt.legend(custom, ['Standard', 'Electric', 'Smart'], scatterpoints=1, frameon=True, fancybox=True, 
           shadow=False, framealpha = 1, borderpad=1, borderaxespad=1, labelspacing=0.5, 
           ncol = 3, title='Bike Type', title_fontsize=12, fontsize=10, facecolor='white', 
           markerfirst=True, handlelength=2, handletextpad=0.5, bbox_to_anchor=(-0.4, 4.3));
# -------------------------------------------------------

plt.subplots_adjust(wspace=0.05, hspace=0.3);

# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.2.9 Average hourly bike rentals by bike type and pass type in quarter of 2019.png', dpi=300, bbox_inches='tight')

Observations:

  • Even though Standard bikes are the most popular choice for the customers with One Day pass during the first quarter of the year, the number of bike rentals subjected to One Day pass decreased to a point that there is no significant difference in bike pereference between Standard bikes and Smart bikes towards the end of the year 2019.
  • The customers that has Monthly pass preferred Standard bikes during the first quarter of the year, however the Electric bikes gained more popularity over the rest of the year 2019.
  • As the number of bike rentals subjectd to Annual pass are very low, the bike preference of customers with Annual pass is not evaluated.

----------------------------------------------

3.2.10 Insights:

  1. The classification of bikes was introduced at the end of the year 2018. Hence the rentals related to unknown bike category are ignored and the analysis limited to the year 2019.
  2. The bike rentals for the Standard bike type decreases over the year 2019, while the rentals for the bike type Smart and Electric increases with in the timeline of the year. Hence, even though Standard bikes are popular during the start of the year 2019, customers preferred Smart and Electric bikes towards the end of the year 2019. Hence it can be concluded that the lauch of Electric and Smart bikes are a success.
  3. Even though Standard bikes are the most popular choice during the first quarter of the year, the Electric bikes gradually gained popularity among One Way trips over the rest of the year.
  4. The customers that take Round Trips does not have any preference over bike types.
  5. Even though Standard bikes are the most popular choice for the customers with One Day pass during the first quarter of the year, the number of bike rentals subjected to One Day pass decreased to a point that there is no significant difference in bike pereference between Standard bikes and Smart bikes towards the end of the year 2019.
  6. The customers that has Monthly pass preferred Standard bikes during the first quarter of the year, however the Electric bikes gained more popularity over the rest of the year 2019.

3.2.11 Reforms proposed:

  1. Even though Smart bikes were introduced along with the Electric bikes, they failed to gain as much popularity as of Electric bikes. Hence dicounts should be announced to increase the rental activity of Smart bikes during the peak hours, which inturn helps the stations to maintain the availabilty of other bikes types.
  2. Use Smart bikes in promotional events like Bike rallies to familiarize customers with its features and encourage the customers to prefer Smart bikes in the future.

3.3 Is monthly pass, the most subscibed pass type among customers?

  • Column: pass_type
  • Data type: categorical data, nominal
  • Plot : Bar chart, Point plot , Facet grid

3.3.1 Aggregated distribution of bike rentals based over pass type:

In [150]:
# Assign color palette as per requirement
sb.set_style("white")
sb.set_palette(palette = 'colorblind', n_colors = 10, desat = None)
base_color = sb.color_palette()[6]

# prepare data for the plot
pass_type_order = bikeshare.pass_type.value_counts().index
max_count = bikeshare['pass_type'].value_counts().max()
tick_values = np.arange(0, max_count + 100000, 100000)
tick_names = ['{:0.1f} M'.format(v/1000000) for v in tick_values]

# Seaborn's countplot
ax = sb.countplot(data = bikeshare, x = 'pass_type', color = base_color, alpha= 1, 
                  order = pass_type_order, saturation = 0.5)

# improve plot aesthetics
plt.title('Aggregated bike rentals based on customer pass\n', fontsize = 16, weight = 'bold')
plt.xlabel('\nPass type', fontsize = 14)
plt.ylabel('Bike rentals (million)\n', fontsize = 14)
plt.xticks(fontsize = 12)
plt.yticks(tick_values, tick_names, fontsize = 12)

# add annotations
# -------------------------------------------------------
n_points = bikeshare.shape[0]
pass_type_counts = bikeshare['pass_type'].value_counts()

# get the current tick locations and labels
locs, labels = plt.xticks()

# loop through each pair of locations and labels
for loc, label in zip(locs, labels):
    try:
        # get the text property for the label to get the correct count
        count = pass_type_counts[label.get_text()]

    except KeyError:
        count = 0
    
    count_percent = 100*count/n_points
    if count_percent < 0.1:
        pct_string = '< 1%'
    else:
        pct_string = '{:0.0f}%'.format(count_percent)

    # print the annotation depending on the bar length
    if count < (n_points/20):
        plt.text(loc, count + (n_points/30), pct_string, ha = 'center', color = 'black', fontsize = 12)
    else:
        plt.text(loc, count - (n_points/25), pct_string, ha = 'center', color = 'black', fontsize = 12);
# -------------------------------------------------------


# loops over the different bars in each figure and
# fade out the bars other than highest rental pass
for i, p in enumerate(ax.patches): 
        if i != 0:
            p.set_alpha(0.6);

sb.despine();

# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.3.1 Aggregated distribution of bike rentals based over pass type.png', dpi=300, bbox_inches='tight')

The above plot depicts that the Monthly Pass is the most popular subscription among the customers. However, the calculation is performed based on the aggregated data of the bike rentals over 3 years and require deeper analysis segmented over each year for any hidden insights.

3.3.2 Aggregated yearly rentals based on pass type:

In [151]:
# create a dataset for bike rentals over the years by pass type
categorical_counts = bikeshare.groupby([bikeshare['pass_type'], 
                                        bikeshare['year']]).size().reset_index(name='rentals')
categorical_counts.head(10)
Out[151]:
pass_type year rentals
0 Walk-up 2017 65938
1 Walk-up 2018 46140
2 One Day 2017 5412
3 One Day 2018 89595
4 One Day 2019 76185
5 Monthly 2017 143044
6 Monthly 2018 161060
7 Monthly 2019 171562
8 Flex 2018 263
9 Annual 2017 10937

Line plot:

In [152]:
# set the palette as per requirement
sb.set_style('whitegrid')
flatui = ["#26bda7", "#9b59b6", "#3498db", "#e74c3c", "#34495e"]
sb.set_palette(flatui, n_colors=5, desat=0.6)

plt.figure(figsize = [6, 4])
max_count = categorical_counts.rentals.max()
y_tick_values = np.arange(0, max_count + 50000, 50000)
y_tick_names = ['{:0.0f} K'.format(v/1000) for v in y_tick_values]

# plot line plot
ax = sb.lineplot(data=categorical_counts, x = "year", y = "rentals", hue="pass_type", linewidth=3, alpha = 0.8, 
                 style="pass_type", err_style="bars", markers = ['o', 'o', 'o', 'o', 'o'], markersize=10)
ax.lines[0].set_linestyle("-")
ax.lines[1].set_linestyle("-")
ax.lines[2].set_linestyle("-")
ax.lines[3].set_linestyle("-")
ax.lines[4].set_linestyle("-")

plt.title('Aggregated yearly rentals based on pass type\n', weight = 'bold', fontsize = 16)
plt.yticks(y_tick_values, y_tick_names, fontsize=12)
plt.xlabel('\nYear', fontsize=14)
plt.ylabel('Rentals (Thousands)\n', fontsize=14)
# set custom xticks to avoid segmentation of continuous values of year (['2017.25', '2017.50', '2017.75, ....'])
# get alloted xticks, if decimal part equals to 0, set it as xtick value else skip the xtick value
xlabels = ['{:.0f}'.format(x) if divmod(x, 1)[1] == 0 else "" for x in ax.get_xticks()]
ax.set_xticklabels(xlabels)
plt.xticks(fontsize=12)

# customize legend
leg = ax.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, ncol = 1, framealpha = 1, 
          borderpad=1, borderaxespad=1, loc = 'upper right', labelspacing=0.5,  
          title='Pass Type', title_fontsize=12, fontsize=10, facecolor='white', markerfirst=True, 
          handlelength=2, handletextpad=0.5, bbox_to_anchor=(1.35, 1))

leg_lines = leg.get_lines()
leg_lines[1].set_linestyle("-")
leg_lines[2].set_linestyle("-")
leg_lines[3].set_linestyle("-")
leg_lines[4].set_linestyle("-")
leg.texts[0].set_text("");

# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.3.2 Aggregated yearly rentals based on pass type.png', dpi=300, bbox_inches='tight')

Observations:

  • The above plot depicts that the Monthly pass has always been the most popular choice for the customers. And discontinuation of Walk-up pass in 2019 has even more increased the number of bike rentals subjectd to Monthly pass.
  • Flex pass is an experimental introduction with insignificant number of bike rentals to include in the further analysis.
  • There is a slight increase in the rentals subjected to Annual pass in the year 2019.

Let us take a look at the other factors that influence the pass type over time:

3.3.3 Average hourly rentals based on pass type and trip type:

In [153]:
# create a dataset for bike rentals over each hour in a day by pass type and trip type
hours_df = bikeshare.groupby([bikeshare["year"], 
                              bikeshare["month"],
                              bikeshare["day"],
                              bikeshare["hour"],
                              bikeshare["trip_type"],
                              bikeshare["pass_type"]]).count()['trip_id'].reset_index(name='rentals')
hours_df.head(10)
Out[153]:
year month day hour trip_type pass_type rentals
0 2017 1 1 0 One Way Walk-up 3.0
1 2017 1 1 0 One Way One Day NaN
2 2017 1 1 0 One Way Monthly 3.0
3 2017 1 1 0 One Way Flex NaN
4 2017 1 1 0 One Way Annual NaN
5 2017 1 1 0 Round Trip Walk-up 3.0
6 2017 1 1 0 Round Trip One Day NaN
7 2017 1 1 0 Round Trip Monthly NaN
8 2017 1 1 0 Round Trip Flex NaN
9 2017 1 1 0 Round Trip Annual NaN

Facet grid:

In [154]:
# Assign palette as per requirement
sb.set_style('white')
flatui = ["#26bda7", "#9b59b6", "#3498db", "#e74c3c", "#34495e"]
sb.set_palette(flatui, n_colors=5, desat=None)

# Facet grid with point plot
plot_order = hours_df.month.sort_values(ascending=True).unique()
g = sb.FacetGrid(data = hours_df, col = 'year', row = 'trip_type', height = 3, aspect = 1, hue = 'pass_type')
g.map(sb.pointplot, "month", "rentals", order= plot_order, linestyles = "-", ci = None, markers = ['']);
g.fig.subplots_adjust(top=0.8)
g.fig.suptitle('Average hourly rentals based on pass type and trip type', fontsize = 16, weight = 'bold')
g.set_titles('Trip = {row_name} | Year = {col_name}', weight = 'bold', size = 13, color = 'dimgrey')

# improve plot aesthetics
# -------------------------------------------------------
g.set_xlabels('\nMonth of the year', size = 13)
g.set_ylabels('Avg. bike rentals\n', size = 13)
g.set_yticklabels(size = 10)
# iterate over axes of FacetGrid
for ax in g.axes.flat:
     # get x labels
    labels = ax.get_xticklabels()
#     for i,l in enumerate(labels):
        # skip labels
#         if not (i%5 == 0): labels[i] = ''
    # set new labels
    ax.set_xticklabels(labels, size = 10)
    
ax1 = g.facet_axis(0,1)
ax2 = g.facet_axis(0,2)
ax3 = g.facet_axis(1,1)
ax4 = g.facet_axis(1,2)
for ax in [ax1, ax2, ax3, ax4]:
    sb.despine(left = True, ax = ax)
    
for i, ax in enumerate(g.axes.flat):
    if i in [0, 1, 2]:
        ax.lines[0].set_alpha(0.3)
        ax.lines[1].set_alpha(0.3)
        ax.lines[3].set_alpha(0.3)
        ax.lines[4].set_alpha(0.3)
    if i in [3, 4, 5]:
        ax.lines[0].set_alpha(0.3)
        ax.lines[2].set_alpha(0.3)
        ax.lines[3].set_alpha(0.3)
        ax.lines[4].set_alpha(0.3)
# -------------------------------------------------------

# add custom legend
# -------------------------------------------------------
custom = [Line2D([], [], marker='.', color=sb.color_palette()[0], linestyle='-', linewidth = 2),
          Line2D([], [], marker='.', color=sb.color_palette()[1], linestyle='-', linewidth = 2),
          Line2D([], [], marker='.', color=sb.color_palette()[2], linestyle='-', linewidth = 2),
          Line2D([], [], marker='.', color=sb.color_palette()[3], linestyle='-', linewidth = 2),
          Line2D([], [], marker='.', color=sb.color_palette()[4], linestyle='-', linewidth = 2)]
labels = hours_df.pass_type.sort_values(ascending=True).unique()
plt.legend(custom, labels, scatterpoints=1, frameon=True, fancybox=True, 
           shadow=False, framealpha = 1, borderpad=1, borderaxespad=1, labelspacing=0.5, 
           ncol = 1, title='Bike Type', title_fontsize=12, fontsize=10, facecolor='white', 
           markerfirst=True, handlelength=2, handletextpad=0.5, bbox_to_anchor=(1.6, 2.4));  
# -------------------------------------------------------

plt.subplots_adjust(wspace=0.05, hspace=0.3);

# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.3.3 Average hourly rentals based on pass type and trip type.png', dpi=300, bbox_inches='tight')

Observations:

  • Majority of the bike rentals subjected to One Way trips are taken by the customers with Monthly subscription.
  • The rentals taken on One Day pass experience a steady decrease subjected to One Way trips over the years 2018 and 2019, which might be the reason for the increase in monthly subscibers for the second half of the year 2019.
  • Majority of the bike rentals subjected to Round Trips are taken by the customers with One Day subscription.

----------------------------------------------

3.3.4 Insights:

  1. Monthly pass has always been the most popular choice for the customers. And discontinuation of Walk-up pass in 2019 has even more increased the number of bike rentals subjectd to Monthly pass.
  2. There is a slight increase in the rentals subjected to Annual pass in the year 2019.
  3. Majority of the bike rentals subjected to One Way trips are taken by the customers with Monthly subscription.
  4. The number bike rentals taken on One Day subscription experienced a steady decrease subjected to One Way trips over the years 2018 and 2019, which might be the reason for the increase in monthly subscibers for the second half of the year 2019.
  5. Majority of the bike rentals subjected to Round Trips are taken by the customers with One Day subscription.

3.3.5 Reforms proposed:

  1. Discounts should be announced on One Day subscription to encourage tourists and non-subscribers to rent a bike.

3.4 Does majority of the customers utilize base fare option to reach their destintions? If yes, what percent of bike rentals generate extra income in the form of extended fares?

  • Column: fare_type, fare
  • Data type: (categorical data, nominal), (numerical, continuous)
  • Plot : Bar chart, Count plot, Facet grid, Point plot

3.4.1 Aggregated bike rentals based on fare type:

In [155]:
# Assign color palette as per requirement
sb.set_style("white")
sb.set_palette(palette = 'colorblind', n_colors = 10, desat = None)
base_color = sb.color_palette()[8]

# prepare data for the plot
fare_type_order = bikeshare.fare_type.value_counts().index
max_count = bikeshare['fare_type'].value_counts().max()
tick_values = np.arange(0, max_count + 100000, 100000)
tick_names = ['{:0.1f} M'.format(v/1000000) for v in tick_values]

# Seaborn's countplot
sb.countplot(data = bikeshare, x = 'fare_type', color = base_color, alpha= 0.6, 
             order = fare_type_order, saturation = 1)

# improve plot aesthetics
plt.title('Aggregated bike rentals based on fare type\n', fontsize = 16, weight = 'bold')
plt.xlabel('\nFare type', fontsize = 14)
plt.ylabel('Bike rentals (million)\n', fontsize = 14)
plt.xticks(fontsize = 12)
plt.yticks(tick_values, tick_names, fontsize = 12)

# add annotations
# -------------------------------------------------------
n_points = bikeshare.shape[0]
fare_type_counts = bikeshare['fare_type'].value_counts()
# get the current tick locations and labels
locs, labels = plt.xticks()

# loop through each pair of locations and labels
for loc, label in zip(locs, labels):
    try:
        # get the text property for the label to get the correct count
        count = fare_type_counts[label.get_text()]

    except KeyError:
        count = 0
    
    pct_string = '{:0.0f}%'.format(100*count/n_points)

    # print the annotation depending on the bar length
    if count < (n_points/10):
        plt.text(loc, count + (n_points/20), pct_string, ha = 'center', color = 'black', fontsize = 14)
    else:
        plt.text(loc, count - (n_points/10), pct_string, ha = 'center', color = 'black', fontsize = 14);
# -------------------------------------------------------

sb.despine();

# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.4.1 Aggregated bike rentals based on fare type.png', dpi=300, bbox_inches='tight')

Observations:

  • The above plot depicts that the majority of the customers utilize Base fare option to reach their destintions.
  • Decrease in percentage of Extended fares will result in decrease in income generation. As the percentage of Extended fares are less than 20%, some business reforms/promotional programs have to be taken to encourage customers to ride bikes for longer durations.

However not all Base fares are free for the first 30 minutes. Unlike other pass types, the Walk-up pass charge a fare of 1 dollar for Base fare type.

3.4.2 Calculation of Income distribution of trip fares:

In [156]:
# compute the descriptive statistics of trip fares
bikeshare['fare'].describe()
Out[156]:
count    808589.000000
mean          1.203841
std           6.951361
min           0.000000
25%           0.000000
50%           0.000000
75%           1.000000
max         540.750000
Name: fare, dtype: float64

Breakdown the trip fares into customized sections based on the descriptive statistics of trip fares.

In [157]:
# divide the fare into customized sections
bin = [-1,0,5,10,50,100,600]
#use pd.cut function to attribute the values into its specific bins
fare = pd.cut(bikeshare['fare'],bin)
fare = fare.to_frame()
fare.columns = ['fare_sections']
fare.sample(10)
Out[157]:
fare_sections
681590 (-1, 0]
262812 (5, 10]
239450 (-1, 0]
801531 (-1, 0]
301773 (-1, 0]
808082 (-1, 0]
803646 (-1, 0]
589207 (-1, 0]
590474 (-1, 0]
112248 (-1, 0]

Count plot:

In [158]:
# Assign palette as per requirement
sb.set_palette('colorblind', n_colors=10, desat = 0.8)
base_color = sb.color_palette()[8]

# Seaborn's count plot
sb.countplot(data = fare, x = 'fare_sections', color = base_color, alpha= 0.8, saturation = 1)


# improve plot aesthetics
# -------------------------------------------------------
plt.title('Income distribution of trip fares\n', fontsize = 16, weight = 'bold')
plt.xlabel('\nFare (Dollars)', fontsize = 14)
plt.ylabel('Rentals (million)\n', fontsize = 14)
# obtain y_ticks and convert them to a multiple of millions
y_tick_locs = []
locs, labels = plt.yticks()

# loop through each pair of locations and labels
for loc, label in zip(locs, labels):
    y_tick_locs.append(int(loc))
    
y_tick_names = ['{:0.1f} M'.format(loc/1000000) for loc in y_tick_locs]
plt.yticks(y_tick_locs, y_tick_names, fontsize=12)
# assigning xticks here will interfere with annotations
# -------------------------------------------------------


# add annotations
# -------------------------------------------------------
n_points = fare.shape[0]
fare_counts = fare.fare_sections.value_counts()
fare_max = fare_counts.max()
# get the current tick locations and labels
locs, labels = plt.xticks()

# loop through each pair of locations and labels
for loc, label in zip(locs, labels):
    # get the text property for the label to get the correct count
    str = (label.get_text()[-4:-1])
    num = [int(s) for s in str.split() if s.isdigit()]
    if num[0] in fare_counts.index[0]:
        count = fare_counts.values[0]
    elif num[0] in fare_counts.index[1]:
        count = fare_counts.values[1]
    elif num[0] in fare_counts.index[2]:
        count = fare_counts.values[2]
    elif num[0] in fare_counts.index[3]:
        count = fare_counts.values[3]
    elif num[0] in fare_counts.index[4]:
        count = fare_counts.values[4]
    else:
        count = 0

    if (100*count/n_points) < 0.1:
        pct_string = '< 0.1%'
    else:
        pct_string = '{:0.1f}%'.format(100*count/n_points)

    # print the annotation depending on the bar length
    if count < (fare_max/10):
        plt.text(loc, count+(fare_max/25), pct_string, ha = 'center', color = 'black', weight = 'normal', fontsize = 12)
    else:
        plt.text(loc, count-(fare_max/10), pct_string, ha = 'center', color = 'black', weight = 'normal', fontsize = 12)
# -------------------------------------------------------
    
    
# get xticks and change the first categorical expression tto just zero dollars
x_labels_new = ['[0]']
# get the current tick locations and labels
x_locs, x_labels = plt.xticks()
for x_label in x_labels[1:]:
    x_labels_new.append(x_label.get_text())
plt.xticks(x_locs, x_labels_new, fontsize=12)

sb.despine();

# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.4.2 Income distribution of trip fares.png', dpi=300, bbox_inches='tight')

Observations:

  • It is evident that the majority of the customers utilize base fare option to reach their destintions.
  • Around 25.5% of the bike rentals generate extra income in the form of Extended fares, which reflects a healthy business model.

3.4.3 Average monthly rentals over years by fare type:

In [159]:
# create a dataset for monthly rentals over years by fare type
month_df = bikeshare.groupby([bikeshare["year"], 
                              bikeshare["month"],
                              bikeshare["fare_type"]]).count()['trip_id'].reset_index(name='rentals')
month_df.head(10)
Out[159]:
year month fare_type rentals
0 2017 1 Base 8925
1 2017 1 Extended 1231
2 2017 2 Base 8450
3 2017 2 Extended 918
4 2017 3 Base 12142
5 2017 3 Extended 1532
6 2017 4 Base 12110
7 2017 4 Extended 1525
8 2017 5 Base 17219
9 2017 5 Extended 1805

Facet Grid:

In [160]:
sb.set_style('white')
flatui = ['#6b8a99', '#31404a', '#60b6f0']
sb.set_palette(flatui, n_colors=3, desat=0.8)

# Seaborn's point plot
plot_order = month_df.month.sort_values(ascending=True).unique()
g = sb.FacetGrid(data = month_df, col = 'fare_type', col_wrap = 2, height = 4, aspect = 1.2, hue = 'year')
g.map(sb.pointplot, "month", "rentals", order= plot_order, linestyles = "-", ci = None);

g.fig.subplots_adjust(top=0.8)
g.fig.suptitle('Average monthly rentals over years by fare type', fontsize = 16, weight = 'bold')
g.set_titles('{col_name}', weight = 'bold', size = 14, color = 'dimgrey')

# improve plot aesthetics
y_tick_names = []
for i, ax in enumerate(g.axes.flat):
    # get y labels
    for y_label in ax.get_yticklabels():
        y_label_value = int(y_label.get_text().replace('−','-'))
        y_label_new_value = '{:0.0f} K'.format(y_label_value/1000)
        y_tick_names.append(y_label_new_value)
g.set_xticklabels(plot_order, size = 12)
g.set_yticklabels(y_tick_names, size = 12)
g.set_xlabels('\nMonth of the year', size = 14)
g.set_ylabels('Avg. Bike rentals\n', size = 14)
plt.subplots_adjust(wspace=0.05, hspace=0.2);

# add custom legend
custom = [Line2D([], [], marker='.', color=sb.color_palette()[0], linestyle='-', linewidth = 2),
          Line2D([], [], marker='.', color=sb.color_palette()[1], linestyle='-', linewidth = 2),
          Line2D([], [], marker='.', color=sb.color_palette()[2], linestyle='-', linewidth = 2)]

plt.legend(custom, ['2017', '2018', '2019'], scatterpoints=1, frameon=True, fancybox=True, 
           shadow=False, framealpha = 1, borderpad=1, borderaxespad=1, labelspacing=0.5, 
           ncol = 1, title='Year', title_fontsize=12, fontsize=10, facecolor='white', 
           markerfirst=True, handlelength=2, handletextpad=0.5, bbox_to_anchor=(1, 1.1));

# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.4.3 Average monthly rentals over years by fare type.png', dpi=300, bbox_inches='tight')

Observations:

  • The above plot depicts that, even though Base fare is the most popular choice for the customers, the average number of bike rentals subjected to early 6 months of year 2019 is very low.
  • Also the rentals with Extended fare type for 2019 has decreased compared to previous year.

----------------------------------------------

3.4.4 Insights:

  1. The majority of the customers utilize base fare option to reach their destintions. However, the recent year 2019 experienced a relatively less number of bike rentals for first and second quarters as compared to the third and fourth quarters. Reforms must be taken to increase the bike rentals for the first half of the yearly timeline.
  2. Around 25.5% of the bike rentals generated extra income in the form of Extended fares, which potrays a good business model. However, the average number of the bike rentals subjectd to Extended fare for the year 2019 are relatively less than 2018 and need to be increased by adopting new rentals techniques that encourage customers to ride the bikes for longer duration of time.

3.4.5 Reforms proposed:

  1. Discounts/promotions should be announced to encourage customers to ride bikes for longer durations.

  • Columns: trip_type, bike_type, pass_type
  • Data type: categorical data, nominal
  • Plot : Bar chart

Bar Chart:

In [161]:
def count_subplot(subplot, color, cat_type, alpha, sat):
    # plot the distribution of bike rentals based on category types
    #-----------------------Start of subplot-----------------------
    
    # prepare the data for the plot
    sb.set_style('darkgrid')
    base_color = sb.color_palette()[color]
    plt.subplot(1, 4, subplot)
    max_count = bikeshare.shape[0]
    y_tick_values = np.arange(0, max_count + 100000, 100000)
    y_tick_names = ['{:0.1f} M'.format(v/1000000) for v in y_tick_values]
    cat_order = bikeshare[cat_type].value_counts().index
    
    # plot countplot
    sb.countplot(data = bikeshare, x = cat_type, color = base_color, alpha= alpha, order = cat_order, saturation = sat)
    
    # improve plot aesthetics
    plt.title('Rentals based on {} type'.format(cat_type[0: 4].title()), fontsize = 16, weight = 'bold')
    plt.xlabel('\n{} type'.format(cat_type[0: 4].title()), fontsize = 14)
    plt.xticks(fontsize = 12)
    if subplot == 1:
        plt.ylabel('Rentals (million)\n', fontsize = 14)
        plt.yticks(y_tick_values, y_tick_names, fontsize = 12)
    else:
        plt.ylabel('')
        plt.yticks(y_tick_values, [])

    # add annotations
    # -------------------------------------------------------
    n_points = bikeshare.shape[0]
    cat_type_counts = bikeshare[cat_type].value_counts()
    # get the current tick locations and labels
    locs, labels = plt.xticks()

    # loop through each pair of locations and labels
    for loc, label in zip(locs, labels):
        try:
            # get the text property for the label to get the correct count
            count = cat_type_counts[label.get_text()]

        except KeyError:
            count = 0
            
        pct_string = '{:0.0f}%'.format(100*count/n_points)

        # print the annotation depending on the bar length
        if count < (n_points/10):
            plt.text(loc, count + (n_points/25), pct_string, ha = 'center', color = 'black', fontsize = 13)
        else:
            plt.text(loc, count - (n_points/15), pct_string, ha = 'center', color = 'black', fontsize = 13);
    # -------------------------------------------------------
    #-------------------------End of subplot------------------------


# Assign color palette and figure size as per requirement
plt.figure(figsize = [20, 6])
sb.set_palette(palette = 'colorblind', n_colors = 10, desat = None)
base_color = sb.color_palette()[0]

# plot syntax : count_subplot(subplot, color, cat_type, alpha, sat)
count_subplot(subplot=1, color=0, cat_type='trip_type', alpha=0.5, sat=1)
count_subplot(subplot=2, color=2, cat_type='bike_type', alpha=0.5, sat=1)
count_subplot(subplot=3, color=6, cat_type='pass_type', alpha=0.6, sat=0.8)
count_subplot(subplot=4, color=8, cat_type='fare_type', alpha=0.6, sat=1)

plt.subplots_adjust(wspace=0.05, hspace=0.3);

# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.5 Comparision of bike rentals based on various categorical parameters.png', dpi=300, bbox_inches='tight')

Insight:

  1. Based on classification of aggregated bike rentals over various parameters, it can be concluded that most customers prefer standard bike over smart bikes, takes more One Way trips than Round Trip's, and prefers Monthly Pass over other subscriptions.

3.6 Does a majority of customer database is compromised of working individuals?

  • Column: hour
  • Data type: continuous data
  • Plot : Distribution plot, Line plot

3.6.1 Aggregated Hourly distribution of bike rentals:

In [162]:
plt.figure(figsize = [8, 6])

# Assign palette and grid as per requirement
sb.set_style('darkgrid')

# prepare data for the plot
x = bikeshare.groupby(bikeshare['hour']).count()['trip_id'].index
y = bikeshare.groupby(bikeshare['hour']).count()['trip_id'].values
x_tick_values = np.arange(0,  23+1, 1)
x_tick_names = ['{:}'.format(v) for v in x_tick_values]
y_tick_values = np.arange(0,  bikeshare.hour.value_counts().max()+10000, 10000)
y_tick_names = ['{:0.0f} K'.format(v/1000) for v in y_tick_values]

# matplotlib's line plot
plt.plot(x, y, linewidth=2.0, color = 'lightskyblue')

# improve plot aesthetics
plt.title('Aggregated Hourly distribution of bike rentals', fontsize = 16, weight = 'bold')
plt.xlabel('\nHour of the day', fontsize = 14)
plt.ylabel('Rentals (thousands)\n', fontsize = 14)
plt.xticks(x_tick_values, x_tick_names, fontsize = 12)
plt.yticks(y_tick_values, y_tick_names, fontsize = 12)

# fill the area under the line
plt.fill_between(x, y, color = 'lightskyblue')

# draw the vertical axial line at the peak hour
peak_hour = bikeshare['hour'].value_counts(ascending=False).index[0]
plt.axvline(peak_hour, color='black', alpha=0.3, linewidth=2);

# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.6.1 Aggregated distribution of bike rentals based on hour of the day.png', dpi=300, bbox_inches='tight')

The above plots depict that the most busy hours are in the evenings and plots a vertical axial line that denotes the hour with maximum aggregated bike rentals over the hour of the day, which is 5:00 PM. Let us look at average number of bike rentals for an hour in a day for more clear interpretation of trends.

3.6.2 Average bike rentals based on the hour of the day:

In [163]:
# create a dataset for bike rentals over each hour in a day
hours_df = bikeshare.groupby([bikeshare["year"], 
                              bikeshare["month"],
                              bikeshare["day"],
                              bikeshare["hour"]]).count()['trip_id'].reset_index(name='rentals')

hours_df['rentals'] = hours_df['rentals'].fillna(0).astype(int)
hours_df.head(10)
Out[163]:
year month day hour rentals
0 2017 1 1 0 9
1 2017 1 1 1 5
2 2017 1 1 2 8
3 2017 1 1 3 2
4 2017 1 1 4 1
5 2017 1 1 5 2
6 2017 1 1 6 1
7 2017 1 1 7 1
8 2017 1 1 8 4
9 2017 1 1 9 5

Point plot:

In [164]:
# Assign color palette and figure size as per requirement
plt.figure(figsize=[12,4])
sb.set_style('white')

# Seaborn's point plot
sb.pointplot(data = hours_df, x = "hour", y = "rentals", linestyles = "-", color = 'lightskyblue')

# improve plot aesthetics
plt.title('Average bike rentals based on hour of the day\n',  weight = 'bold', fontsize = 16)
plt.ylabel('Avg. rentals\n', fontsize = 14)
plt.xlabel('\nHour of the day', fontsize = 14)
plt.xticks(fontsize = 12)

# get ytick locs and rearrage them with respect to zero
locs, labels = plt.yticks()
max_count = locs.max()
y_tick_values = np.arange(0,  max_count+10, 10)
y_tick_names = ['{:0.0f}'.format(v) for v in y_tick_values]
plt.yticks(y_tick_values, y_tick_names, fontsize = 12)

# add annotations
# -------------------------------------------------------
avg_rental_counts = hours_df.groupby([hours_df["hour"]]).mean()['rentals']
avg_rental_max = avg_rental_counts.max()
clrs = ['gold' if (count > ((avg_rental_max*3)/5)) else 'limegreen' for count in avg_rental_counts ]

# get the current tick locations and labels
locs, labels = plt.xticks()

# loop through each pair of locations and labels
for loc, label, avg_rental_count, clr in zip(locs, labels, avg_rental_counts, clrs):
    try:
        count = avg_rental_count
    except KeyError:
        count = 0   
    pct_string = '{:0.0f}'.format(count)
    # print the annotation depending on the bar length
    plt.text(loc, count + int(avg_rental_max/10), pct_string, ha = 'center', color = 'black', 
             fontsize = 12, bbox=dict(pad=1.9,alpha=0.2,color='none',fc=clr))
# -------------------------------------------------------

sb.despine(top=True, right=True, left=False, bottom=False);

# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.6.2 Average bike rentals based on hour of the day.png', dpi=300, bbox_inches='tight')

Observations:

  • The bike rentals aggregated by the hour of the day, depicts that the rentals slowly starts to increase from 6:00 AM untill 5:00 PM with the peaks at 8:00 AM, 12:00 PM, and 5:00 PM, which represent Morning office hours, Afternoon Lunch time, and Evening office relieve timings respectively. This concludes that the huge portion of the customer database contain working individuals, who use bikes for the transportatioin.
  • The average bike rentals over the hour of the day depicts that the rentals are least during night and early hours.

3.6.3 Average rentals based on the weekday over individual years:

In [165]:
# create a dataset for bike rentals over the days in a week
weekday_df = bikeshare.groupby([bikeshare['year'], 
                                bikeshare['month'],
                                bikeshare['week'],
                                bikeshare['weekday']]).count()['trip_id'].reset_index(name='rentals')

weekday_df.head(10)
Out[165]:
year month week weekday rentals
0 2017 1 First Monday 259.0
1 2017 1 First Tuesday 327.0
2 2017 1 First Wednesday 350.0
3 2017 1 First Thursday 231.0
4 2017 1 First Friday 361.0
5 2017 1 First Saturday 277.0
6 2017 1 First Sunday 270.0
7 2017 1 Second Monday 306.0
8 2017 1 Second Tuesday 245.0
9 2017 1 Second Wednesday 315.0
In [166]:
# Assign color palette and figure size as per requirement
plt.figure(figsize=[8,5])
sb.set_style('white')
flatui = ['#6b8a99', '#31404a', '#60b6f0']
sb.set_palette(flatui, n_colors=3, desat=0.8)

# Seaborn's point plot
plot_order = weekday_df.weekday.sort_values(ascending=True).unique()
sb.pointplot(data = weekday_df, x = "weekday", y = "rentals", linestyles = "-", 
             hue = 'year', ci = None, order = plot_order)

# improve plot aesthetics
plt.title('Average bike rentals based on weekday of the week\n',  weight = 'bold', fontsize = 16)
plt.ylabel('Avg. bike rentals\n', fontsize = 14)
plt.xlabel('\nDay of the week', fontsize = 14)
plt.xticks(fontsize = 12)

# get ytick locs and rearrage them with respect to zero
locs, labels = plt.yticks()
max_count = locs.max()
y_tick_values = np.arange(0,  max_count+100, 100)
y_tick_names = ['{:0.0f}'.format(v) for v in y_tick_values]
plt.yticks(y_tick_values, y_tick_names, fontsize = 12)

# plot legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, ncol = 1, framealpha = 1, 
           borderpad=1, borderaxespad=1, bbox_to_anchor = (1, 0.8), loc = 6, labelspacing=0.5,  
           title='Year', title_fontsize=12, fontsize=10, facecolor='white', markerfirst=True, 
           handlelength=2, handletextpad=0.5)

# draw the vertical axial lines
plt.axhline(500, color='grey', alpha=1, linewidth=0.5, linestyle='--')
plt.axhline(600, color='grey', alpha=1, linewidth=0.5, linestyle='--')
plt.axhline(700, color='grey', alpha=1, linewidth=0.5, linestyle='--')
plt.axhline(800, color='grey', alpha=1, linewidth=0.5, linestyle='--')
plt.axhline(900, color='grey', alpha=1, linewidth=0.5, linestyle='--')
sb.despine(top=True, right=True, left=False, bottom=False);

# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.6.3 Average bike rentals based on day of the week over years.png', dpi=300, bbox_inches='tight')

Observations:

  • The above plots depicts that the bike rentals decrease during the non-working days such as Saturday and Sunday. This reinforces the argument that the majority of the customer base consists of working individuals.
  • However the year 2019 experieces a steep decrease in bike rentals during the non-working days compared to previous years. This reflects the failure in attraction of tourists and non-subscribers to ride a bike over weekends.

3.6.4 Average bike rentals based on hour of the day over years by trip type and pass type:

In [167]:
# create a dataset for bike rentals over each hour in a day by trip type and pass type
hours_df = bikeshare.groupby([bikeshare["year"], 
                              bikeshare["month"],
                              bikeshare["day"],
                              bikeshare["hour"],
                              bikeshare["trip_type"],
                              bikeshare["pass_type"]]).count()['trip_id'].reset_index(name='rentals')
hours_df.head(10)
Out[167]:
year month day hour trip_type pass_type rentals
0 2017 1 1 0 One Way Walk-up 3.0
1 2017 1 1 0 One Way One Day NaN
2 2017 1 1 0 One Way Monthly 3.0
3 2017 1 1 0 One Way Flex NaN
4 2017 1 1 0 One Way Annual NaN
5 2017 1 1 0 Round Trip Walk-up 3.0
6 2017 1 1 0 Round Trip One Day NaN
7 2017 1 1 0 Round Trip Monthly NaN
8 2017 1 1 0 Round Trip Flex NaN
9 2017 1 1 0 Round Trip Annual NaN
In [168]:
# Assign palette as per requirement
sb.set_style('white')
flatui = ['#e36297', '#42d4be']
sb.set_palette(flatui, n_colors=2, desat=0.8)

# Facet grid with point plot
plot_order = hours_df.hour.sort_values(ascending=True).unique()
g = sb.FacetGrid(data = hours_df, col = 'pass_type', row = 'year', height = 4, aspect = 1, hue = 'trip_type')
g.map(sb.pointplot, "hour", "rentals", order= plot_order, linestyles = "-", ci = None, markers = ['.']);
g.fig.subplots_adjust(top=0.85)
g.fig.suptitle('Average bike rentals based on hour of the day over years by trip type and pass type', 
               fontsize = 16, weight = 'bold')
g.set_titles('Year = {row_name} | Pass = {col_name}', weight = 'bold', size = 14, color = 'dimgrey')

# improve plot aesthetics
# -------------------------------------------------------
g.set_xlabels('\nHour of the day', size = 14)
g.set_ylabels('Avg. bike rentals\n', size = 14)
g.set_yticklabels(size = 12)
# iterate over axes of FacetGrid
for ax in g.axes.flat:
     # get x labels
    labels = ax.get_xticklabels()
    for i,l in enumerate(labels):
        # skip labels
        if not (i%5 == 0): labels[i] = ''
    # set new labels
    ax.set_xticklabels(labels, size = 12)
# -------------------------------------------------------

# add custom legend
# -------------------------------------------------------
custom = [Line2D([], [], marker='.', color=sb.color_palette()[0], linestyle='-', linewidth = 2),
          Line2D([], [], marker='.', color=sb.color_palette()[1], linestyle='-', linewidth = 2)]

plt.legend(custom, ['One Way', 'Round Trip'], scatterpoints=1, frameon=True, fancybox=True, 
           shadow=False, framealpha = 1, borderpad=1, borderaxespad=1, labelspacing=0.5, 
           ncol = 3, title='Trip Type', title_fontsize=12, fontsize=10, facecolor='white', 
           markerfirst=True, handlelength=2, handletextpad=0.5, bbox_to_anchor=(-1.20, 4.1));
# -------------------------------------------------------

plt.subplots_adjust(wspace=0.05, hspace=0.3);

# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.6.4 Average bike rentals based on hour of the day over years by trip type and pass type.png', dpi=300, bbox_inches='tight')

Observation:

The above plot depicts that the average number of bike rentals taken by non-subscribers and tourists (Walk-up pass or One Day pass) is less than the number of bike rentals taken by Working Individuals (with Monthly pass). This reinforces the argument that the majority of customer database is compromised of working individuals.

----------------------------------------------

3.6.5 Insights:

  1. The bike rentals aggregated by the hour of the day, depicts that the rentals slowly starts to increase from 6:00 AM untill 5:00 PM with the peaks at 8:00 AM, 12:00 PM, and 5:00 PM, which represent Morning office hours, Afternoon Lunch time, and Evening office relieve timings respectively. This concludes that the huge portion of the customer database contain working individuals, who use bikes for the transportatioin.
  2. The average bike rentals over the hour of the day depicts that the rentals are least during night and early hours.
  3. The bike rentals decrease during the non-working days such as Saturday and Sunday. This reinforces the argument that the majority of the customer base consists of working individuals.
  4. The average number of bike rentals taken by non-subscribers and tourists (Walk-up pass or One Day pass) is less than the number of bike rentals taken by Working Individuals (with Monthly pass). This reinforces the argument that the majority of customer database is compromised of working individuals.

3.6.6 Reforms proposed:

  1. The year 2019 experieces a steep decrease in bike rentals during the non-working days compared to previous years. This reflects the failure in attraction of tourists and non-subscribers to ride a bike over weekends. Promotions should be announced for tourists and non-subscribers to encourage them to rent a bike.
  2. Encouraging working individuals to ride a bike during non-working days in a week will increase the revenue generation.

3.7 How can we increase the bike rentals based on hour of the day?

  • Column: hour
  • Data type: continuous data
  • Plot : Distribution plot, Line plot

3.7.1 Average bike rentals based on the time of the day:

In [169]:
# create a dataset for bike rentals for each daytime of the day
daytime_df = bikeshare.groupby([bikeshare['year'], 
                                bikeshare['month'], 
                                bikeshare['day'], 
                                bikeshare['daytime']]).count()['trip_id'].reset_index(name='rentals')

daytime_df['rentals'] = daytime_df['rentals'].fillna(0).astype(int)
daytime_df.head(10)
Out[169]:
year month day daytime rentals
0 2017 1 1 Early hours 27
1 2017 1 1 Morning 35
2 2017 1 1 Afternoon 143
3 2017 1 1 Evening 50
4 2017 1 1 Night 15
5 2017 1 2 Early hours 4
6 2017 1 2 Morning 50
7 2017 1 2 Afternoon 145
8 2017 1 2 Evening 44
9 2017 1 2 Night 16

Point plot:

In [170]:
# Assign color palette and grid as per requirement
sb.set_style('white')

# Seaborn's point plot
sb.pointplot(data = daytime_df, x = "daytime", y = "rentals", linestyles = "-", color = 'lightskyblue')

# improve plot aesthetics
plt.title('Avg. bike rentals based on daytime\n',  weight = 'bold', fontsize = 16)
plt.ylabel('Avg. bike rentals\n', fontsize = 14)
plt.xlabel('\nSection of the day', fontsize = 14)
plt.xticks(fontsize = 12)

# get ytick locs and rearrage them with respect to zero
locs, labels = plt.yticks()
max_count = locs.max()
y_tick_values = np.arange(0,  max_count+50, 50)
y_tick_names = ['{:0.0f}'.format(v) for v in y_tick_values]
plt.yticks(y_tick_values, y_tick_names, fontsize = 12)

# add annotations
# -------------------------------------------------------
cat_order = daytime_df.daytime.sort_values(ascending=True).unique()
avg_rental_counts = daytime_df.groupby([daytime_df["daytime"]]).mean()['rentals'][cat_order]
avg_rental_max = avg_rental_counts.max()
clrs = ['gold' if (count > ((avg_rental_max*4)/5)) else 'limegreen' for count in avg_rental_counts ]

# get the current tick locations and labels
locs, labels = plt.xticks()

# loop through each pair of locations and labels
for loc, label, avg_rental_count, clr in zip(locs, labels, avg_rental_counts, clrs):
    try:
        count = avg_rental_count
    except KeyError:
        count = 0   
    pct_string = '{:0.0f}'.format(count)
    # print the annotation depending on the bar length
    plt.text(loc, count + int(avg_rental_max/10), pct_string, ha = 'center', color = 'black', 
             fontsize = 12, bbox=dict(pad=1.9,alpha=0.2,color='none',fc=clr))
# -------------------------------------------------------

sb.despine(top=True, right=True, left=False, bottom=False);

# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.7.1 Average bike rentals based on time of the day.png', dpi=300, bbox_inches='tight')

----------------------------------------------

3.7.2 Insights:

  1. The rental activity is highest around Afternoon, with Morining and Evening being closest. This denotes that the customers use bike rentals the most during daytime.Subsequently the rental activity is least at Early Hours and Night times.

3.7.3 Reforms proposed:

  1. Promoting fitness activities will increase rental activity during Early Hours.
  2. While tie up with night events will boost Night-time rentals.

3.8 Does bike rentals decrease during the end of the month?

  • Column: day
  • Data type: continuous data
  • Plot : Distribution plot, Line plot

3.8.1 Aggregated bike rentals based on the day of the month:

In [171]:
# Assign figure size and color palette as per requirement
plt.figure(figsize = [18, 6])
sb.set_style('darkgrid')
sb.set_palette(palette = 'colorblind', n_colors = 10, desat = None)
clr = sb.color_palette()[4]

# prepare data for the plot
day_index_max = bikeshare.day.sort_values(ascending=False).unique()[0]
daily_order = np.arange(1,  day_index_max+1, 1)
max_count = bikeshare.day.value_counts().max()
min_count = bikeshare.day.value_counts().min()
tick_values = np.arange(0,  max_count+10000, 10000)
tick_names = ['{:0.0f} K'.format(v/1000) for v in tick_values]
day_values = bikeshare.day.value_counts().values
clrs = ['thistle' if (x > min_count) else clr for x in day_values]

# Seaborn's count plot
sb.countplot(data = bikeshare, x = 'day', palette=clrs, 
             alpha= 1, order = daily_order, saturation = 0.8)

# improve plot aesthetics
plt.title('Aggregative distribution of bike rentals based on day of the month', fontsize = 16, weight = 'bold')
plt.xlabel('\nDay of the month', fontsize = 14)
plt.ylabel('Rentals (thousands)\n', fontsize = 14)
plt.xticks(fontsize = 12)
plt.yticks(tick_values, tick_names, fontsize = 12)

# add annotations
# -------------------------------------------------------
n_points = bikeshare.shape[0]
daily_counts = bikeshare.day.value_counts()
daily_max = daily_counts.max()
# get the current tick locations and labels
locs, labels = plt.xticks()
    
# loop through each pair of locations and labels
for loc, label in zip(locs, labels):
    try:
        count = daily_counts[int(label.get_text())] 
    except KeyError:
        count = 0   
    pct_string = '{:0.1f}%'.format(100*count/n_points)
    # print the annotation depending on the bar length
    if count < (daily_max/10):
        plt.text(loc, count + (daily_max/40), pct_string, ha = 'center', color = 'black', fontsize = 12)
    else:
        plt.text(loc, count + (daily_max/40), pct_string, ha = 'center', color = 'black', fontsize = 12)
# -------------------------------------------------------

# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.8.1 Aggregated distribution of bike rentals based on day of the month.png', dpi=300, bbox_inches='tight')

The above plots depicts that the rentals decrease during the end of the month, especially on 31'st of the month. The bike rentals are categorized over the day of the month, the distribution is calculated based on the cummulative summation of day over 3 years but not individual month. Hence, there are only 21 occurances of day 31st, while other days has an occurance of 36 over the time period of 3 years (2017-2019), except for days 29 and 30 which counts to 33 due to their absence in February month. This denotes that the rate of rentals is not actually low on 31st compared to other days. Let us perform a more detailed analysis by calculating the average bike rentals based on day of the month.

3.8.2 Average rentals based on the day of the month:

Create a dataset which contain bike rentals relative to each day in the month over respective years. Care should be taken as not to inlcude the day 31st in every month of the year. Use only the unique appearences of categorical combinations and do not fill the NULL values with numerical zero's so as to consider bike rentals of day 31st on certain months only but not in every month.

In [172]:
# create a dataset for bike rentals over the days of the month
days_df = bikeshare.groupby([bikeshare["year"], 
                             bikeshare["month"],
                             bikeshare["day"]]).size().reset_index(name='rentals')

days_df.tail(10)
Out[172]:
year month day rentals
1085 2019 12 22 442
1086 2019 12 23 398
1087 2019 12 24 512
1088 2019 12 25 303
1089 2019 12 26 455
1090 2019 12 27 700
1091 2019 12 28 650
1092 2019 12 29 536
1093 2019 12 30 804
1094 2019 12 31 805

Check the appearances of individual days over the dataset created:

In [173]:
cat_order = days_df.day.sort_values(ascending=True).unique()
print('Month - Occurances')
days_df.day.value_counts()[cat_order]
Month - Occurances
Out[173]:
1     36
2     36
3     36
4     36
5     36
6     36
7     36
8     36
9     36
10    36
11    36
12    36
13    36
14    36
15    36
16    36
17    36
18    36
19    36
20    36
21    36
22    36
23    36
24    36
25    36
26    36
27    36
28    36
29    33
30    33
31    21
Name: day, dtype: int64

The above cell depicts that the days 29, 30, and 31 has relatively less appearences compared to the other days in the month. This confirms the reliability of the dataset to calculate the average bike rentals based on day of the month.

Point plot:

In [174]:
# Assign grid and figure size as per requirement
plt.figure(figsize=[12,4])
sb.set_style('whitegrid')

# Seaborn's point plot
sb.pointplot(data = days_df, x = "day", y = "rentals", linestyles = "-", color = 'lightskyblue', ci=None)

# improve plot aesthetics
plt.title('Avg. bike rentals based on day of the month\n',  weight = 'bold', fontsize = 16)
plt.ylabel('Avg. rentals\n', fontsize = 14)
plt.xlabel('\nDay of the month', fontsize = 14)
sb.despine(top=True, right=True, left=True, bottom=False);

# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.8.2 Average bike rentals based on day of the month.png', dpi=300, bbox_inches='tight')

On contrary to the previous plots, the above plot depicts that the days subjected to end of the month have relatively high average bike rentals compared to most of the days in the month. However the above plot is not potrayed with respect to zero on the axis and amplifies the difference between the average rentals for any given day in the month. Re-plot the above graph with respect to zero over y-aixs.

3.8.3 Average rentals based on the day of the month over Zero:

In [175]:
# Assign grid and figure size as per requirement
plt.figure(figsize=[12,4])
sb.set_style('white')

# Seaborn's point plot
sb.pointplot(data = days_df, x = "day", y = "rentals", linestyles = "-", color = 'lightskyblue', ci=None)

# improve plot aesthetics
plt.title('Avg. bike rentals based on day of the month\n',  weight = 'bold', fontsize = 16)
plt.ylabel('Avg. bike rentals\n', fontsize = 14)
plt.xlabel('\nDay of the month', fontsize = 14)
plt.xticks(fontsize = 12)

# get ytick locs and rearrage them with respect to zero
locs, labels = plt.yticks()
max_count = locs.max()
y_tick_values = np.arange(0,  max_count+100, 100)
y_tick_names = ['{:0.0f}'.format(v) for v in y_tick_values]
plt.yticks(y_tick_values, y_tick_names, fontsize = 12)

# draw the vertical axial lines
plt.axhline(700, color='black', alpha=1, linewidth=0.5, linestyle='--')
plt.axhline(800, color='black', alpha=1, linewidth=0.5, linestyle='--')

sb.despine(top=True, right=True, left=True, bottom=False);

# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.8.3 Average bike rentals based on day of the month.png', dpi=300, bbox_inches='tight')

----------------------------------------------

3.8.4 Insights:

  1. The bike rentals aggregated over the day of the month depicts that the rentals decrease slightly during the end of the month. However on deeper analysis of the data by calculating the average bike rentals, it is clear that the rental activity actually increases during the end of the month.
  2. Also, the distribution of average bike rentals over the day of the month, ranges between 700 and 800 only. This depicts that there is no significant differance in average bike rentals subjected to any two days given in a month.

3.9 Does a weekday have any effect on the bike rentals? If the effect is negative, propose any ideas to overcome the crisis?

  • Column: day of the week
  • Data type: continuous data
  • Plot : Distribution plot, Line plot

3.9.1 Aggregated bike rentals based on the day of the week:

In [176]:
# Assign figure size and color palette as per requirement
plt.figure(figsize = [8, 6])
sb.set_style('white')

# prepare data for the plot
day_order = bikeshare.weekday.value_counts().index
max_count = bikeshare.weekday.value_counts().max()
min_count = bikeshare.weekday.value_counts().min()
mean_count = bikeshare.weekday.value_counts().mean()
y_tick_values = np.arange(0, max_count+25000, 25000)
y_tick_names = ['{:0.0f} K'.format(v/1000) for v in y_tick_values]
weekday_values = bikeshare.weekday.value_counts().values
clrs = ['#aeebe5' if (x > mean_count) else 'cyan' for x in weekday_values ]

# Seaborn's count plot
sb.countplot(data = bikeshare, x = 'weekday', palette=clrs, 
             alpha= 0.5, order = day_order, saturation = 0.5)

# improve plot aesthetics
plt.title('Aggregated distribution of bike rentals over the weekday\n', fontsize = 16, weight = 'bold')
plt.xlabel('\nDay of the week', fontsize = 14)
plt.ylabel('Rentals (thousands)\n', fontsize = 14)
plt.xticks(fontsize = 12)
plt.yticks(y_tick_values, y_tick_names, fontsize = 12)

# add annotations
# -------------------------------------------------------
n_points = bikeshare.shape[0]
day_counts = bikeshare.weekday.value_counts(ascending=False).values
day_max = day_counts.max()
# get the current tick locations and labels
locs, labels = plt.xticks()

# loop through each pair of locations and labels
for loc, label in zip(locs, labels):
    # get the text property for the label to get the correct count
    try:
        count = day_counts[loc]
        pct_string = '{:0.1f}%'.format(100*count/n_points)
    except KeyError:
        count = 15000
        pct_string = '0%'

    # print the annotation depending on the bar length
    if count < (day_max/10):
        plt.text(loc, count+(day_max/25), pct_string, ha = 'center', color = 'black', fontsize = 12)
    else:
        plt.text(loc, count-(day_max/15), pct_string, ha = 'center', color = 'black', fontsize = 12)
# -------------------------------------------------------

sb.despine();

# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.9.1 Aggregated distribution of bike rentals over the week.png', dpi=300, bbox_inches='tight')
  • The aggregated distribution of bike rentals over the week depicts that the weekends say particularly Saturday and Sunday have relatively low bike rentals (less than mean value of the total bike rentals over a week) compared to other days in the week. This effect is the result of having majority of the customer database containing working employees who use bikes as a ride to the work.

  • However, the occurances of the weekday have some effect on the aggregated rentals (not all weekdays have same number of occurrences in a month), hence calculate the average bike rentals over weekday for more clear analysis.

3.9.2 Average bike rentals based on the weekday:

Create a dataset which contain bike rentals relative to day of the week over respective months in any year. Care should be taken as to inlcude all days in every week of the month. Use all categorical combinations and fill the NULL values with numerical zero's so as to consider bike rentals subjected to each day in any week.

In [177]:
 # create a dataset for bike rentals over the days in a week
weekday_df = bikeshare.groupby([bikeshare['year'], 
                                bikeshare['month'],
                                bikeshare['week'],
                                bikeshare['weekday']]).count()['trip_id'].reset_index(name='rentals')

weekday_df['rentals'] = weekday_df['rentals'].fillna(0).astype(int)
weekday_df.head(10)
Out[177]:
year month week weekday rentals
0 2017 1 First Monday 259
1 2017 1 First Tuesday 327
2 2017 1 First Wednesday 350
3 2017 1 First Thursday 231
4 2017 1 First Friday 361
5 2017 1 First Saturday 277
6 2017 1 First Sunday 270
7 2017 1 Second Monday 306
8 2017 1 Second Tuesday 245
9 2017 1 Second Wednesday 315

Point plot:

In [178]:
# Assign palette and figure size as per requirement
plt.figure(figsize=[8,4])
sb.set_style('white')
flatui = ['cyan']
sb.set_palette(flatui, n_colors=1, desat=0.5)
base_color = sb.color_palette()[0]

# Seaborn's point plot
sb.pointplot(data = weekday_df, x = "weekday", y = "rentals", linestyles = "-", color = base_color)

# improve plot aesthetics
plt.title('Avg. bike rentals based on weekday\n',  weight = 'bold', fontsize = 16)
plt.ylabel('Avg. bike rentals\n', fontsize = 14)
plt.xlabel('\nWeekday', fontsize = 14)
plt.xticks(fontsize = 12)
# get ytick locs and rearrage them with respect to zero
locs, labels = plt.yticks()
max_count = locs.max()
y_tick_values = np.arange(0,  max_count+100, 100)
y_tick_names = ['{:0.0f}'.format(v) for v in y_tick_values]
plt.yticks(y_tick_values, y_tick_names, fontsize = 12)

# add annotations
# -------------------------------------------------------
cat_order = weekday_df.weekday.sort_values(ascending=True).unique()
avg_rental_counts = weekday_df.groupby([weekday_df["weekday"]]).mean()['rentals'][cat_order]
avg_rental_max = avg_rental_counts.max()
clrs = ['gold' if (count > ((avg_rental_max*9)/10)) else '#d479a3' for count in avg_rental_counts ]

# get the current tick locations and labels
locs, labels = plt.xticks()

# loop through each pair of locations and labels
for loc, label, avg_rental_count, clr in zip(locs, labels, avg_rental_counts, clrs):
    try:
        count = avg_rental_count
    except KeyError:
        count = 0   
    pct_string = '{:0.0f}'.format(count)
    # print the annotation depending on the bar length
    plt.text(loc, count + int(avg_rental_max/10), pct_string, ha = 'center', color = 'black', 
             fontsize = 12, bbox=dict(pad=1.9,alpha=0.2,color='none',fc=clr))
# -------------------------------------------------------

# draw the vertical axial lines
plt.axhline(600, color='grey', alpha=1, linewidth=0.5, linestyle='--')
plt.axhline(700, color='grey', alpha=1, linewidth=0.5, linestyle='--')

sb.despine(top=True, right=True, left=True, bottom=False);

# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.9.2 Average bike rentals based on day of the week.png', dpi=300, bbox_inches='tight')
  • The above plot depicts that the distribution of average bike rentals over the day of the week, mostly which ranges between 600 and 700. The yellow annotations represent the busy days of the week. This depicts that there is slight decrease in average bike rentals towards the weekend (saturday, sunday) while Friday apeears to be most busiest day of the week.

  • However, the average is calculated based on summation of all rentals over 3 years. Perform an individual analysis for more clear insights.

3.9.3 Average rentals based on the weekday over individual years:

In [179]:
# Assign color palette and figure size as per requirement
plt.figure(figsize=[8,5])
sb.set_style('white')
flatui = ['#6b8a99', '#31404a', '#60b6f0']
sb.set_palette(flatui, n_colors=3, desat=0.8)

# Seaborn's point plot
plot_order = weekday_df.weekday.sort_values(ascending=True).unique()
ax = sb.pointplot(data = weekday_df, x = "weekday", y = "rentals", linestyles = "-", 
             hue = 'year', ci = None, order = plot_order)

# improve plot aesthetics
plt.title('Average bike rentals based on weekday over years\n',  weight = 'bold', fontsize = 16)
plt.ylabel('Avg. bike rentals\n', fontsize = 14)
plt.xlabel('\nDay of the week', fontsize = 14)
plt.xticks(fontsize = 12)

# get ytick locs and rearrage them with respect to zero
locs, labels = plt.yticks()
max_count = locs.max()
y_tick_values = np.arange(0,  max_count+100, 100)
y_tick_names = ['{:0.0f}'.format(v) for v in y_tick_values]
plt.yticks(y_tick_values, y_tick_names, fontsize = 12)

# plot legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, ncol = 1, framealpha = 1, 
           borderpad=1, borderaxespad=1, bbox_to_anchor = (1, 0.8), loc = 6, labelspacing=0.5,  
           title='Year', title_fontsize=12, fontsize=10, facecolor='white', markerfirst=True, 
           handlelength=2, handletextpad=0.5)

# Add a inverted triangle marker at desired data point
ax.lines[0].set_alpha(0.8)
ax.lines[1].set_alpha(0.8)
ax.lines[2].set_markevery(every=[5,6])
ax.lines[2].set_marker('v')        
ax.lines[2].set_markersize(12)
ax.lines[2].set_markeredgewidth(3)
ax.lines[2].set_markerfacecolor('lightskyblue')
ax.lines[2].set_markeredgecolor('indianred')

# draw the vertical axial lines
plt.axhline(500, color='grey', alpha=1, linewidth=0.5, linestyle='--')
plt.axhline(600, color='grey', alpha=1, linewidth=0.5, linestyle='--')
plt.axhline(700, color='grey', alpha=1, linewidth=0.5, linestyle='--')
plt.axhline(800, color='grey', alpha=1, linewidth=0.5, linestyle='--')

sb.despine(top=True, right=True, left=False, bottom=False);

# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.9.3 Average bike rentals based on day of the week over years.png', dpi=300, bbox_inches='tight')

Observation:

The above plots depicts that the years 2017, and 2018 have a relatively slight decrease in average bike rentals compared to other days in the week, however the year 2019 experience a sudden drop in average bike rentals during weekends say Saturday and Sunday. This is not a good sign for a healthy business model and requires reforms.

Reform:

Organizing/promoting, fitness/recreational activities like Bike rallies will potentially increase the bike rentals on the weekends/holidays, significantly.

Let us take a look at the other factors that influence the bike rentals over weekday:

3.9.4 Average bike rentals based on the weekday and trip type:

Every fifth week in a month won't have all the occurances of the weekday as the month's are limited by uneven equal distribution of 7 day span (number of days in a week). Hence in order to accurately calculate the average rentals of the weekday, use size() method, which takes only the unique combinations in the occurances and ignores occurances with NULL values.

In [180]:
# create a dataset for bike rentals over each weekday in a week categorized by trip type
weekday_df = bikeshare.groupby([bikeshare["year"], 
                                bikeshare["month"],
                                bikeshare["week"],
                                bikeshare["weekday"],
                                bikeshare["trip_type"]]).size().reset_index(name='rentals')
weekday_df.head(10)
Out[180]:
year month week weekday trip_type rentals
0 2017 1 First Monday One Way 228
1 2017 1 First Monday Round Trip 31
2 2017 1 First Tuesday One Way 288
3 2017 1 First Tuesday Round Trip 39
4 2017 1 First Wednesday One Way 325
5 2017 1 First Wednesday Round Trip 25
6 2017 1 First Thursday One Way 211
7 2017 1 First Thursday Round Trip 20
8 2017 1 First Friday One Way 325
9 2017 1 First Friday Round Trip 36

Point plot:

In [181]:
def assign_clrs(counts):
    clr_list = []
    for i in range(len(counts)):
        try:
            if counts[i] > counts[i-1]:
                clr_list.append('mediumseagreen')
            else:
                clr_list.append('salmon')
        except KeyError:
            clr_list.append('mediumseagreen')
    return clr_list


def assign_df(dataframe, column):
    index_list1 = []
    index_list2 = []
    df = dataframe.reset_index()
    for i in range(df.shape[0]):
        if df.iloc[i].rentals > df.iloc[i-1].rentals:
            index_list1.append(i)
        else:
            index_list2.append(i)
    inc_df = df.loc[index_list1,:]
    dec_df = df.loc[index_list2,:]
    level_order = categorical_df[column].sort_values(ascending=True).unique()
    ordered_cat = pd.api.types.CategoricalDtype(ordered = True, categories = level_order)
    inc_df[column] = inc_df[column].astype(ordered_cat)
    dec_df[column] = dec_df[column].astype(ordered_cat)
    return inc_df, dec_df


# Assign figure size and color palette
plt.figure(figsize=[8, 5])
sb.set_style('white')
flatui = ['#e36297', '#42d4be']
sb.set_palette(flatui, n_colors=2, desat=0.8)
base_color = sb.color_palette()[0]

# Saborn's pointplot
ax1 = sb.pointplot(data = weekday_df, x = "weekday", y = "rentals", linestyles = ['-', '-'], 
                   hue = 'trip_type', ci = None, markers=["", ""])


# add annotations
# -------------------------------------------------------
avg_rentals = weekday_df.groupby([weekday_df["trip_type"], weekday_df["weekday"]]).mean()['rentals'].reset_index()
avg_rentals_base = avg_rentals.query(' trip_type == "One Way" ')
avg_rentals_extended = avg_rentals.query(' trip_type == "Round Trip" ')

# get the current tick locations and labels
locs, labels = plt.xticks()

for categorical_df in [avg_rentals_base, avg_rentals_extended]:
    avg_rental_counts = categorical_df['rentals']
    avg_rental_max = avg_rental_counts.max()
    clrs = ['gold' if (count > ((avg_rental_max*9)/10)) else 'grey' for count in avg_rental_counts ]
    inc_df, dec_df = assign_df(categorical_df, 'weekday')
    sb.pointplot(data = inc_df, x = "weekday", y = "rentals", linestyles = "", scale = 1, ci = None, 
                 color = 'green', markers = ["^"], ax = ax1);
    sb.pointplot(data = dec_df, x = "weekday", y = "rentals", linestyles = "", scale = 1, ci = None, 
                 color = 'red', markers = ["v"], ax = ax1);
    
    # loop through each pair of locations and labels
    for loc, label, avg_rental_count, clr in zip(locs, labels, categorical_df.rentals, clrs):
        try:
            count = avg_rental_count
        except KeyError:
            count = 0   
        pct_string = '{:0.0f}'.format(count)
        indent = 40
        # print the annotation depending on the bar length
        plt.text(loc, count+indent, pct_string, ha = 'center', color = 'black', 
                 fontsize = 12, bbox=dict(pad=1.9,alpha=0.2,color='none',fc=clr))
# -------------------------------------------------------


# add custom legend
# -------------------------------------------------------
custom_lines = [Line2D([0], [0], color=sb.color_palette()[0], lw=2),
                Line2D([0], [0], color=sb.color_palette()[1], lw=2)]

plt.legend(custom_lines, ['One Way', 'Round Trip'], scatterpoints=1, frameon=True, fancybox=True, shadow=False, 
           ncol = 1, framealpha = 1, borderpad=1, borderaxespad=1, loc = 'upper right', labelspacing=0.5,  
           title='Trip type', title_fontsize=12, fontsize=10, facecolor='white', 
           markerfirst=True, handlelength=2, handletextpad=0.5, bbox_to_anchor=(1.3, 1));
# -------------------------------------------------------


# improve plot aesthetics
# -------------------------------------------------------
plt.title('Average weekday bike rentals categorized by trip type\n',  weight = 'bold', fontsize = 16)
plt.ylabel('Avg. bike rentals\n', fontsize = 14)
plt.xlabel('\nDay of the week', fontsize = 14)
plt.xticks(fontsize = 12)

# get ytick locs and rearrage them with respect to zero
locs, labels = plt.yticks()
weekday_rental_avg_max = 800
y_tick_values = np.arange(0, weekday_rental_avg_max+100, 100)
y_tick_names = ['{:0.0f}'.format(v) for v in y_tick_values]
plt.yticks(y_tick_values, y_tick_names, fontsize = 12)
# -------------------------------------------------------


for loc in y_tick_values:
    plt.axhline(loc, ls='--', color='grey', linewidth=0.5, alpha=0.3)
    
sb.despine(top=True, right=True, left=False, bottom=False);

# savefig by passing (bbox_inches='tight'),which will adjust dhe figure to include all of the x and y labels
plt.savefig('explanatory plots/3.9.4 Average weekday bike rentals categorized by trip type.png', dpi=300, bbox_inches='tight')

In the above plot, the yellow annotations depicts the busy days of the week, while markers depicts whether the respctive day's bike rentals increased/decreased in comparision with the previous day.

Observation:

The above plot depicts that even if the customers that take One Way trips (probably working individuals who ride to work) decreases over weekends, the customers that take Round Trips increases during the weekends. This is clearly evident in the above plot subjected to Round Trips, where yellow annotations depict the busy days of the week, while markers denote the respctive day's rentals in comparision with the previous day.

This behaviour is strongly reinforced by the plot that depicts the average bike rentals categorized by fare types over the weekday.

3.9.5 Average bike rentals based on the weekday and fare type:

In [182]:
# create a dataset for bike rentals over each weekday in a week categorized by fare type
weekday_df = bikeshare.groupby([bikeshare["year"], 
                                bikeshare["month"],
                                bikeshare["week"],
                                bikeshare["weekday"],
                                bikeshare["fare_type"]]).size().reset_index(name='rentals')
weekday_df.head(10)
Out[182]:
year month week weekday fare_type rentals
0 2017 1 First Monday Base 214
1 2017 1 First Monday Extended 45
2 2017 1 First Tuesday Base 296
3 2017 1 First Tuesday Extended 31
4 2017 1 First Wednesday Base 322
5 2017 1 First Wednesday Extended 28
6 2017 1 First Thursday Base 216
7 2017 1 First Thursday Extended 15
8 2017 1 First Friday Base 335
9 2017 1 First Friday Extended 26

Point plot:

In [183]:
def assign_clrs(counts):
    clr_list = []
    for i in range(len(counts)):
        try:
            if counts[i] > counts[i-1]:
                clr_list.append('mediumseagreen')
            else:
                clr_list.append('salmon')
        except KeyError:
            clr_list.append('mediumseagreen')
    return clr_list


def assign_df(dataframe, column):
    index_list1 = []
    index_list2 = []
    df = dataframe.reset_index()
    for i in range(df.shape[0]):
        if df.iloc[i].rentals > df.iloc[i-1].rentals:
            index_list1.append(i)
        else:
            index_list2.append(i)
    inc_df = df.loc[index_list1,:]
    dec_df = df.loc[index_list2,:]
    level_order = categorical_df[column].sort_values(ascending=True).unique()
    ordered_cat = pd.api.types.CategoricalDtype(ordered = True, categories = level_order)
    inc_df[column] = inc_df[column].astype(ordered_cat)
    dec_df[column] = dec_df[column].astype(ordered_cat)
    return inc_df, dec_df


# Assign figure size and color palette
plt.figure(figsize=[8, 5])
sb.set_style('white')
flatui = ['#577da1', '#eda668']
sb.set_palette(flatui, n_colors=2, desat=0.6)
base_color = sb.color_palette()[0]

# Seaborn's point plot
ax1 = sb.pointplot(data = weekday_df, x = "weekday", y = "rentals", linestyles = ["-", "-"], hue = 'fare_type', 
                   scale = 1, ci = None, markers=["", ""])


# add annotations
# -------------------------------------------------------
avg_rentals = weekday_df.groupby([weekday_df["fare_type"], weekday_df["weekday"]]).mean()['rentals'].reset_index()
avg_rentals_base = avg_rentals.query(' fare_type == "Base" ')
avg_rentals_extended = avg_rentals.query(' fare_type == "Extended" ')

# get the current tick locations and labels
locs, labels = plt.xticks()

for categorical_df in [avg_rentals_base, avg_rentals_extended]:
    avg_rental_counts = categorical_df['rentals']
    avg_rental_max = avg_rental_counts.max()
    clrs = ['gold' if (count > ((avg_rental_max*4)/5)) else 'grey' for count in avg_rental_counts ]
    inc_df, dec_df = assign_df(categorical_df, 'weekday')
    sb.pointplot(data = inc_df, x = "weekday", y = "rentals", linestyles = "", scale = 1, ci = None, 
                 color = 'green', markers = ["^"], ax = ax1);
    sb.pointplot(data = dec_df, x = "weekday", y = "rentals", linestyles = "", scale = 1, ci = None, 
                 color = 'red', markers = ["v"], ax = ax1);
    
    # loop through each pair of locations and labels
    for loc, label, avg_rental_count, clr in zip(locs, labels, categorical_df.rentals, clrs):
        try:
            count = avg_rental_count
        except KeyError:
            count = 0   
        pct_string = '{:0.0f}'.format(count)
        indent = 40
        # print the annotation depending on the bar length
        plt.text(loc, count+indent, pct_string, ha = 'center', color = 'black', 
                 fontsize = 12, bbox=dict(pad=1.9,alpha=0.2,color='none',fc=clr))
# -------------------------------------------------------

# add custom legend 
custom_lines = [Line2D([0], [0], color=sb.color_palette()[0], lw=2),
                Line2D([0], [0], color=sb.color_palette()[1], lw=2)]

plt.legend(custom_lines, ['Base', 'Extended'], scatterpoints=1, frameon=True, fancybox=True, shadow=False, 
           ncol = 1, framealpha = 1, borderpad=1, borderaxespad=1, loc = 'upper right', labelspacing=0.5,  
           title='Fare type', title_fontsize=12, fontsize=10, facecolor='white', 
           markerfirst=True, handlelength=2, handletextpad=0.5, bbox_to_anchor=(1.3, 1));

# improve plot aesthetics
plt.title('Average weekday bike rentals categorized by fare type\n',  weight = 'bold', fontsize = 16)
plt.ylabel('Avg. bike rentals\n', fontsize = 14)
plt.xlabel('\nDay of the week', fontsize = 14)

weekday_rental_avg_max = 800
y_tick_values = np.arange(0, weekday_rental_avg_max+100, 100)
y_tick_names = ['{:0.0f}'.format(v) for v in y_tick_values]
plt.yticks(y_tick_values, y_tick_names, fontsize = 12)
# plt.xlim(ax1.get_xlim())
plt.xticks(fontsize = 12);

# plot custom grid-lines
for loc in y_tick_values:
    plt.axhline(loc, ls='--', color='grey', linewidth=0.5, alpha=0.3)

sb.despine(top=True, right=True, left=False, bottom=False);

# savefig by passing (bbox_inches='tight'),which will adjust dhe figure to include all of the x and y labels
plt.savefig('explanatory plots/3.9.5 Average weekday bike rentals categorized by fare type.png', dpi=300, bbox_inches='tight')

In the above plot, the yellow annotations depicts the busy days of the week, while markers depicts whether the respctive day's bike rentals increased/decreased in comparision with the previous day.

Observation:

The above plot depicts that customers tend to travel for longer durations (bike rentals with extended fares) during the weekends.

3.9.6 Average bike rentals based on the weekday and fare type:

Let us observe the effect of the customer's pass type on the bike rentals over the weekend.

In [184]:
# create a dataset for bike rentals over each weekday in a week categorized by pass type
weekday_df = bikeshare.groupby([bikeshare["year"], 
                                bikeshare["month"],
                                bikeshare["week"],
                                bikeshare["weekday"],
                                bikeshare["pass_type"]]).size().reset_index(name='rentals')
weekday_df.head(10)
Out[184]:
year month week weekday pass_type rentals
0 2017 1 First Monday Walk-up 115
1 2017 1 First Monday Monthly 121
2 2017 1 First Monday Annual 23
3 2017 1 First Tuesday Walk-up 73
4 2017 1 First Tuesday Monthly 234
5 2017 1 First Tuesday Annual 20
6 2017 1 First Wednesday Walk-up 81
7 2017 1 First Wednesday Monthly 244
8 2017 1 First Wednesday Annual 25
9 2017 1 First Thursday Walk-up 38

Point plot:

In [185]:
def assign_clr(pass_type): 
    if (pass_type == "Walk-up"): return sb.color_palette()[0] 
    elif (pass_type == "One Day"): return sb.color_palette()[1] 
    elif (pass_type == "Monthly"): return sb.color_palette()[2] 
    elif (pass_type == "Flex"): return sb.color_palette()[3]
    elif (pass_type == "Annual"): return sb.color_palette()[4]
    return 'gold'


# Assign figure size and color palette
plt.figure(figsize=[8,5])
sb.set_style('white')
flatui = ["#34e0c7", "#c271e3", "#4cb1f5", "#e06458", "#34495e"]
sb.set_palette(flatui, n_colors=5, desat=0.6)
base_color = sb.color_palette()[0]

# Seaborn's pointplot
ax = sb.pointplot(data = weekday_df, x = "weekday", y = "rentals", linestyles = "-", 
                  hue = 'pass_type', scale = 1, ci = None)

# improve plot aesthetics
# -------------------------------------------------------
plt.title('Average weekday bike rentals categorized by pass type\n',  weight = 'bold', fontsize = 16)
plt.ylabel('Avg. bike rentals\n', fontsize = 14)
plt.xlabel('\nDay of the week', fontsize = 14)
plt.xticks(fontsize = 12)
# get ytick locs and rearrage them with respect to zero
locs, labels = plt.yticks()
weekday_rental_avg_max = 800
y_tick_values = np.arange(0, weekday_rental_avg_max+100, 100)
y_tick_names = ['{:0.0f}'.format(v) for v in y_tick_values]
plt.yticks(y_tick_values, y_tick_names, fontsize = 12)
sb.despine(top=True, right=True, left=False, bottom=False);
# -------------------------------------------------------

# plot legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, ncol = 1, 
           framealpha = 1, borderpad=1, borderaxespad=1, loc = 'upper left', labelspacing=0.5,  
           title='Pass type', title_fontsize=12, fontsize=10, facecolor='white', 
           markerfirst=True, handlelength=2, handletextpad=0.5, bbox_to_anchor=(1, 1))

# plot custom grid lines
for loc in y_tick_values:
    plt.axhline(loc, ls='--', color='grey', linewidth=0.5, alpha=0.5)
    

# savefig by passing (bbox_inches='tight'),which will adjust dhe figure to include all of the x and y labels
plt.savefig('explanatory plots/3.9.6 Average weekday bike rentals categorized by pass type.png', dpi=300, bbox_inches='tight')

Observation:

The above plot depicts that the customers with pass types Monthly and Annual are less likely to ride a bike during the weekends. This is may be because of the customer base compromising of working individuals. However, the number of customers that prefer to take Walk-up and One Day passes for a ride increases during the weekends. This denotes that the weekends attract customers other than working individuals i.e, tourists/activists who enjoy taking a ride for the sightseeing.

3.9.7 Average bike rentals based on the weekday and bike type:

Let us observe the effect of the customer's bike preference on the bike rentals over the weekend.

In [186]:
# create a dataset for bike rentals over each weekday in a week categorized by bike type
weekday_df = bikeshare.groupby([bikeshare["year"], 
                                bikeshare["month"],
                                bikeshare["week"],
                                bikeshare["weekday"],
                                bikeshare["bike_type"]]).size().reset_index(name='rentals')
weekday_df.head(10)
Out[186]:
year month week weekday bike_type rentals
0 2017 1 First Monday unknown 259
1 2017 1 First Tuesday unknown 327
2 2017 1 First Wednesday unknown 350
3 2017 1 First Thursday unknown 231
4 2017 1 First Friday unknown 361
5 2017 1 First Saturday unknown 277
6 2017 1 First Sunday unknown 270
7 2017 1 Second Monday unknown 306
8 2017 1 Second Tuesday unknown 245
9 2017 1 Second Wednesday unknown 315

Point plot:

In [187]:
def assign_clr(bike): 
    if (bike == "unknown"): return sb.color_palette()[0] 
    elif (bike == "Standard"): return sb.color_palette()[1] 
    elif (bike == "Electric"): return sb.color_palette()[2] 
    elif (bike == "Smart"): return sb.color_palette()[3] 
    return 'gold'


# Assign palette as per requirement
plt.figure(figsize=[8,5])
sb.set_style('white')
flatui = ['#ff7ddd', '#77f7cc', '#4b99eb', '#aa75fa']
sb.set_palette(flatui, n_colors=4, desat=0.6)
base_color = sb.color_palette()[0]

# Seaborn's point plot
sb.pointplot(data = weekday_df, x = "weekday", y = "rentals", linestyles = "-", hue = 'bike_type', 
             scale = 1, ci = None)

# improve plot aesthetics
plt.title('Average weekday bike rentals categorized by bike type\n',  weight = 'bold', fontsize = 16)
plt.ylabel('Avg. bike rentals\n', fontsize = 14)
plt.xlabel('\nDay of the week', fontsize = 14)
plt.xticks(fontsize = 12)

# get ytick locs and rearrage them with respect to zero
locs, labels = plt.yticks()
weekday_rental_avg_max = 800
y_tick_values = np.arange(0, weekday_rental_avg_max+100, 100)
y_tick_names = ['{:0.0f}'.format(v) for v in y_tick_values]
plt.yticks(y_tick_values, y_tick_names, fontsize = 12)

# add annotations
# -------------------------------------------------------
avg_rentals = weekday_df.groupby([weekday_df["bike_type"], weekday_df["weekday"]]).mean()['rentals'].reset_index()
avg_rentals_max = avg_rentals.rentals.max()
avg_rentals_unknown = avg_rentals.query(' bike_type == "unknown" ')
avg_rentals_standard = avg_rentals.query(' bike_type == "Standard" ')
avg_rentals_electric = avg_rentals.query(' bike_type == "Electric" ')
avg_rentals_smart = avg_rentals.query(' bike_type == "Smart" ')

# get the current tick locations and labels
locs, labels = plt.xticks()

for categorical_df in [avg_rentals_unknown, avg_rentals_standard, avg_rentals_electric, avg_rentals_smart]:
    clrs = [assign_clr(bike) for bike in categorical_df.bike_type]
    # loop through each pair of locations and labels
    for loc, label, avg_rental_count, clr in zip(locs, labels, categorical_df.rentals, clrs):
        try:
            count = avg_rental_count
        except KeyError:
            count = 0   
        pct_string = '{:0.0f}'.format(count)
        indent = 40
        # print the annotation depending on the bar length
        plt.text(loc, count + indent, pct_string, ha = 'center', color = 'black', 
                 fontsize = 12, bbox=dict(pad=1.9,alpha=0.2,color='none',fc=clr))
# -------------------------------------------------------

# add legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, ncol = 1, 
           framealpha = 1, borderpad=1, borderaxespad=1, loc = 'upper left', labelspacing=0.5,  
           title='Bike type', title_fontsize=12, fontsize=10, facecolor='white', 
           markerfirst=True, handlelength=2, handletextpad=0.5, bbox_to_anchor=(1.05, 1))

sb.despine(top=True, right=True, left=False, bottom=False);

# plot custom grid lines
for loc in y_tick_values:
    plt.axhline(loc, ls='--', color='grey', linewidth=0.5, alpha=0.5);

# savefig by passing (bbox_inches='tight'),which will adjust dhe figure to include all of the x and y labels
plt.savefig('explanatory plots/3.9.7 Average weekday bike rentals categorized by bike type.png', dpi=300, bbox_inches='tight')

However let us explore the combined effect of pass_type and bike_type on the bike rentals based on weekday over the individual years.

Observation:

The above plot depicts that the bike rentals subjected to Standard bike, and Electric Bike are more during weekdays. This is because of the working individual customer database. However, the weekends attract customers that prefer Smart bikes.

3.9.8 Average bike rentals based on the weekday over years by bike type and pass type:

In [188]:
# create a dataset for bike rentals for each day in a week over the years by pass type and bike type
temp_df = bikeshare.query(' pass_type == "One Day" or pass_type == "Monthly" or pass_type == "Annual" ').copy()

level_order = ['Standard', 'Electric', 'Smart']
ordered_cat = pd.api.types.CategoricalDtype(ordered = True, categories = level_order)
temp_df['bike_type'] = temp_df['bike_type'].astype(ordered_cat)
level_order = ['One Day', 'Monthly', 'Annual']
ordered_cat = pd.api.types.CategoricalDtype(ordered = True, categories = level_order)
temp_df['pass_type'] = temp_df['pass_type'].astype(ordered_cat)

weekday_df = temp_df.groupby([temp_df["year"], 
                              temp_df["month"],
                              temp_df["week"],
                              temp_df["weekday"],
                              temp_df["bike_type"],
                              temp_df["pass_type"]]).count()['trip_id'].reset_index(name='rentals')
weekday_df.head(10)
Out[188]:
year month week weekday bike_type pass_type rentals
0 2017 1 First Monday Standard One Day NaN
1 2017 1 First Monday Standard Monthly NaN
2 2017 1 First Monday Standard Annual NaN
3 2017 1 First Monday Electric One Day NaN
4 2017 1 First Monday Electric Monthly NaN
5 2017 1 First Monday Electric Annual NaN
6 2017 1 First Monday Smart One Day NaN
7 2017 1 First Monday Smart Monthly NaN
8 2017 1 First Monday Smart Annual NaN
9 2017 1 First Tuesday Standard One Day NaN

Facet Grid:

In [189]:
# Assign palette as per requirement
sb.set_style('white')
flatui = ['#cc3f9d', '#31404a', '#60b6f0']
sb.set_palette(flatui, n_colors=3, desat=0.8)

# Facet grid with point plot
plot_order = weekday_df.weekday.sort_values(ascending=True).unique()
g = sb.FacetGrid(data = weekday_df, col = 'pass_type', row = 'bike_type', margin_titles=True, height = 3, aspect = 1, hue = 'year')
g.map(sb.pointplot, "weekday", "rentals", order= plot_order, linestyles = "-", ci = None, markers = ['.']);
g.fig.subplots_adjust(top=0.8)
g.fig.suptitle('Average bike rentals based on day of the week over years by bike type and pass type', 
               fontsize = 16, weight = 'bold')
g.set_titles('Bike = {row_name} | Pass = {col_name}', weight = 'bold', size = 12, color = 'dimgrey')

# improve plot aesthetics
# -------------------------------------------------------
g.set_xlabels('\nDay of the week', size = 12)
g.set_ylabels('Avg. bike rentals\n', size = 12)
g.set_yticklabels(size = 10)
# iterate over axes of FacetGrid
for ax in g.axes.flat:
     # get x labels
    labels = ax.get_xticklabels()
    # set new labels
    ax.set_xticklabels(labels, rotation = 30, size = 10)
# -------------------------------------------------------

# add custom legend
# -------------------------------------------------------
custom = [Line2D([], [], marker='.', color=sb.color_palette()[0], linestyle='-', linewidth = 2),
          Line2D([], [], marker='.', color=sb.color_palette()[1], linestyle='-', linewidth = 2),
          Line2D([], [], marker='.', color=sb.color_palette()[2], linestyle='-', linewidth = 2)]

plt.legend(custom, ['2017', '2018', '2019'], scatterpoints=1, frameon=True, fancybox=True, 
           shadow=False, framealpha = 1, borderpad=1, borderaxespad=1, labelspacing=0.5, 
           ncol = 3, title='Year', title_fontsize=12, fontsize=10, facecolor='white', 
           markerfirst=True, handlelength=2, handletextpad=0.5, bbox_to_anchor=(0.1, 4.3));
# -------------------------------------------------------

plt.subplots_adjust(wspace=0.05, hspace=0.3);

# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.9.8 Average bike rentals based on weekday over the years by bike type and pass type.png', dpi=300, bbox_inches='tight')
  • The average bike rentals subjected to Monthly pass type and Standard bike type experice a decrement towards weekends.
  • The average bike rentals subjected to Monthly pass type and Electric bike type experice a decrement towards weekends.
  • The average bike rentals subjected to One Day pass type and Standard bike type experice a increment towards weekends.
  • The average bike rentals subjected to One Day pass type and Electric bike type experice a increment towards weekends.
  • The average bike rentals subjected to Smart bike type irrespective of pass type experience a slight increment towards weekends.

Observations:

  • This reflects that the pass_type holds a stronger influence on the bike rentals over the week rather than bike_type.
  • This reinforces the argument that the monthly pass is preferred by working individuals and experiences a decrement in average bike rentals over weekends say non-working days and preferred Standard and Electric bikes.
  • The One Day pass attarcts new/temporary customers like tourists or explorers and experiences an increment in average bike rentals over weekends otherwise a holiday and preferred Standard and Smart bikes.
  • The smart bike experiences a slight increase in average bike rentals over weekends.

----------------------------------------------

3.9.9 Insights:

  1. The years 2017, and 2018 have a relatively slight decrease in average bike rentals compared to other days in the week, however the year 2019 experience a sudden drop in average bike rentals during weekends say Saturday and Sunday. This is not a good sign for a healthy business model and requires reforms.
  2. The customers with long term subscriptios such as Annual pass and Monthly pass prefer Standard bikes and Electric Bikes to travel during working days/weekdays and less likely to travel during weekends. As the customers database contain a majority of working individuals, they tend to prefer One Way trips which decreases during weekends.
  3. The above plot depicts that even if the customers that take One Way trips (probably working individuals who ride to work) decreases over weekends, the customers that take Round Trips increases during the weekends.
  4. The pass_type holds a stronger influence on the bike rentals over the week rather than bike_type.
  5. The smart bike experiences a slight increase in average bike rentals over weekends.
  6. New/temporary customers with no existing pass (say tourists/travellers/activists) tend to take short term pass such as One Day pass and prefer Standard bikes and Smart bikes. Hence Smart bikes experince highest bike rentals during the weekends. Also this category of customers tend to take Round Trips and ride for longer durations resulting in Extended fares thus generating more income to the company.

3.9.10 Reforms proposed:

  1. Organizing/promoting, fitness/recreational activities like Bike rallies will potentially increase the bike rentals on the weekends/holidays, significantly.
  2. The number of customers that take One Day pass who prefer Standard bikes reduced significantly during 2019. Hence attracting this category customers to use standard bikes will enhance the business model significantly.
  3. As major part of the customer database is compromised of working individuals, seize the advantage of low rentals during the weekend and take reforms to normalize the availability of bikes over the stations to support the bike rental traffic on the monday.

3.10 Are there any bike stations that has low bike rental/return activity over geographical distribution and is not scalable for maintainance?

  • Column: start_lat, start_lon, end_lat, end_lon
  • Data type: numerical, continuous
  • Plot : Heat map

3.10.1 Exploration of geographical distribution of bike rentals based on start station's co-ordinates:

In [190]:
# Assign figure size as per requirement
plt.figure(figsize = [8, 4])

h2d = plt.hist2d(data = bikeshare, x = 'start_lat', y = 'start_lon', cmin = 0.5, cmap = 'viridis_r')

# improve plot aesthetics
plt.title('Geographical distribution of Start stations\n', fontsize = 16, weight = 'bold')
plt.xlabel('\nLatitude', fontsize = 14)
plt.ylabel('Longitude\n', fontsize = 14)
plt.xticks(fontsize = 12)
plt.yticks(fontsize = 12)

# add annotations
# -------------------------------------------------------
# getting individual elements
counts = h2d[0]
x_bins = h2d[1]
y_bins = h2d[2]

counts_list = []
x_bin_diff_list = []
y_bin_diff_list = []

for i in range(counts.shape[0]):
    for j in range(counts.shape[1]):
        c = counts[i,j]
        # eliminate nan and append only if c does not exist in counts_list
        if c not in counts_list and not np.isnan(c):
            counts_list.append(c)

for bin in range(len(x_bins)-1):
    x_bin_diff = x_bins[bin+1] - x_bins[bin]
    if x_bin_diff not in x_bin_diff_list:
        x_bin_diff_list.append(x_bin_diff)
        
for bin in range(len(y_bins)-1):
    y_bin_diff = y_bins[bin+1] - y_bins[bin]
    if y_bin_diff not in y_bin_diff_list:
        y_bin_diff_list.append(y_bin_diff)

counts_mean = np.mean(counts_list)
x_bin_size = max(x_bin_diff_list)
y_bin_size = max(y_bin_diff_list)

for i in range(counts.shape[0]):
    for j in range(counts.shape[1]):
        c = counts[i,j]
        if c >= counts_mean: # increase visibility on darkest cells
            plt.text(x_bins[i] + (x_bin_size/2), y_bins[j] + (y_bin_size/2), int(c),
                     ha = 'center', va = 'center', color = 'white', fontsize = 9)
        elif c > 0:
            plt.text(x_bins[i] + (x_bin_size/2), y_bins[j] + (y_bin_size/2), int(c),
                     ha = 'center', va = 'center', color = 'black', fontsize = 9)
# -------------------------------------------------------

plt.colorbar();

# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.10.1 Geographical distribution of Start stations.png', dpi=300, bbox_inches='tight')

The above plot depict that some start stations and end stations have clusters that constitute less than 100 bike rentals over 3 years period of time. These stations are advised to be either relocated or shutdown as the maintaince is significantly more than income generated.

3.10.2 Exploration of geographical distribution of bike rentals based on end station's co-ordinates:

In [191]:
# Assign figure size as per requirement
plt.figure(figsize = [8, 4])

h2d = plt.hist2d(data = bikeshare, x = 'end_lat', y = 'end_lon', cmin = 0.5, cmap = 'viridis_r')

# improve plot aesthetics
plt.title('Geographical distribution of End stations\n', fontsize = 16, weight = 'bold')
plt.xlabel('\nLatitude', fontsize = 14)
plt.ylabel('Longitude\n', fontsize = 14)
plt.xticks(fontsize = 12)
plt.yticks(fontsize = 12)

# add annotations
# -------------------------------------------------------
# getting individual elements
counts = h2d[0]
x_bins = h2d[1]
y_bins = h2d[2]

counts_list = []
x_bin_diff_list = []
y_bin_diff_list = []

for i in range(counts.shape[0]):
    for j in range(counts.shape[1]):
        c = counts[i,j]
        # eliminate nan and append only if c does not exist in counts_list
        if c not in counts_list and not np.isnan(c):
            counts_list.append(c)

for bin in range(len(x_bins)-1):
    x_bin_diff = x_bins[bin+1] - x_bins[bin]
    if x_bin_diff not in x_bin_diff_list:
        x_bin_diff_list.append(x_bin_diff)
        
for bin in range(len(y_bins)-1):
    y_bin_diff = y_bins[bin+1] - y_bins[bin]
    if y_bin_diff not in y_bin_diff_list:
        y_bin_diff_list.append(y_bin_diff)

counts_mean = np.mean(counts_list)
x_bin_size = max(x_bin_diff_list)
y_bin_size = max(y_bin_diff_list)

for i in range(counts.shape[0]):
    for j in range(counts.shape[1]):
        c = counts[i,j]
        if c >= counts_mean: # increase visibility on darkest cells
            plt.text(x_bins[i] + (x_bin_size/2), y_bins[j] + (y_bin_size/2), int(c),
                     ha = 'center', va = 'center', color = 'white', fontsize = 9)
        elif c > 0:
            plt.text(x_bins[i] + (x_bin_size/2), y_bins[j] + (y_bin_size/2), int(c),
                     ha = 'center', va = 'center', color = 'black', fontsize = 9)
# -------------------------------------------------------

plt.colorbar();    

# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.10.2 Geographical distribution of End stations.png', dpi=300, bbox_inches='tight')

The above plot depict that some end stations and end stations have clusters that constitute less than 100 bike rentals over 3 years period of time. These stations are advised to be either relocated or shutdown as the maintaince is significantly more than income generated.

3.10.3 Identification of stations which are financially liable for high maintenance:

Any station has more than one combination of lattitude and longitude.This is because of the geographical extension of the stations over the zone. Hence the bike traffic are to be calculated over the station_id but not the combination of lattitude and longitude.

The stations with least activity are to be accurately identified based on their individual bike rental traffic and bike return traffic. Because some stations might have lower bike rentals but compensates its significance by having high bike return traffic and vice versa. Hence only the stations with lower activity (bike rentals and returns combined) are to be deemed as higher maintenance and eligible for termination.

Create the dataframe with the bike rentals based on start_station_id.

In [192]:
# find the rentals based on start_station_id
start_stations = bikeshare.groupby([bikeshare['start_station_id']]).size().reset_index(name='rentals')
start_stations.rename(columns={'start_station_id':'station_id'}, inplace=True)
start_stations.head()
Out[192]:
station_id rentals
0 3005 35009
1 3006 15863
2 3007 14985
3 3008 11620
4 3009 55

Create the dataframe with the bike returns based on end_station_id.

In [193]:
# find the rentals based on end_station_id
end_stations = bikeshare.groupby([bikeshare['end_station_id']]).size().reset_index(name='returns')
end_stations.rename(columns={'end_station_id':'station_id'}, inplace=True)
end_stations.head()
Out[193]:
station_id returns
0 3005 38639
1 3006 16430
2 3007 11910
3 3008 11912
4 3009 68

Combine the two dataframes into a single dataframe.

In [194]:
stations = pd.merge(start_stations, end_stations, on='station_id', how='outer')
stations = stations.fillna(0)
stations.head()
Out[194]:
station_id rentals returns
0 3005 35009.0 38639
1 3006 15863.0 16430
2 3007 14985.0 11910
3 3008 11620.0 11912
4 3009 55.0 68

Plot the distribution of bike rentals and bike returns for investigation.

In [195]:
# Assign color palette and figure size as per requirement
plt.figure(figsize = [6, 4])
sb.set_style('white')
sb.set_palette(palette = 'colorblind', n_colors = 10, desat = None)
base_color = sb.color_palette()[0]

# Seaborn's regplot
sb.regplot(data = stations, x = 'rentals', y = 'returns', fit_reg = True, 
           scatter_kws = {'alpha' : 1/10}, line_kws = {'alpha' : 0.2}, color = base_color);

# improve plot aestetics
plt.title("Distribution of bike station's traffic\n", fontsize = 16, weight = 'bold')
plt.xlabel('\nBike Rentals (thousands)', fontsize = 14)
plt.ylabel('Bike Returns (thousands)\n', fontsize = 14)

# get xtick locs and rearrage them with respect to zero
x_locs, x_labels = plt.xticks()
x_tick_names = ['{:0.0f} K'.format(loc/1000) if (loc >= 0 and loc % 10000 == 0) else '' for loc in x_locs]
plt.xticks(x_locs, x_tick_names, fontsize = 12)

# get ytick locs and rearrage them with respect to zero
y_locs, y_labels = plt.yticks()
y_tick_names = ['{:0.0f} K'.format(loc/1000) if (loc >= 0 and loc % 10000 == 0) else '' for loc in y_locs]
plt.yticks(y_locs, y_tick_names, fontsize = 12);

# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.10.3 Distribution of bike stations traffic.png', dpi=300, bbox_inches='tight')

The above plot depicts that the bike rentals and returns follow a linear pattern. Also a majority of bike stations are clustered between (0 - 10K) bike returns and (0 - 5K) bike rentals.

3.10.4 Visual display of stations with realtively low activity:

Plot the stations with low activity:

In [196]:
# Assign figure and color palette as per requirement
plt.figure(figsize=[18, 5])
sb.set_style('white')
sb.set_palette('deep', n_colors = 2, desat = 0.8)

# prepare the data for the subplots
low_traffic = stations[(stations['rentals'] < 10) & (stations['returns'] < 10)]
not_low_traffic = stations[~((stations['rentals'] < 10) & (stations['returns'] < 10))]

# left plot: dataset that has all entries
# -------------------------------------------------------
plt.subplot(1, 3, 1)

sb.regplot(data = not_low_traffic, x = 'rentals', y = 'returns', color = 'c',
           fit_reg = False, scatter_kws = {'alpha' : 1/10});

sb.regplot(data = low_traffic, x = 'rentals', y = 'returns', color = 'orange',
           fit_reg = False, scatter_kws = {'alpha' : 1/2});

# improve pot aesthetics
plt.title('Overall traffic\n',  weight = 'bold', fontsize = 16, color = 'dimgrey')
plt.xlabel('\nBike Rentals (thousands)', fontsize = 14)
plt.ylabel('Bike Returns (thousands)\n', fontsize = 14)

# get xtick locs and rearrage them with respect to zero
x_locs, x_labels = plt.xticks()
x_tick_names = ['{:0.0f} K'.format(loc/1000) if (loc >= 0 and loc % 10000 == 0) else '' for loc in x_locs]
plt.xticks(x_locs, x_tick_names, fontsize = 12)

# get ytick locs and rearrage them with respect to zero
y_locs, y_labels = plt.yticks()
y_tick_names = ['{:0.0f} K'.format(loc/1000) if (loc >= 0 and loc % 10000 == 0) else '' for loc in y_locs]
plt.yticks(y_locs, y_tick_names, fontsize = 12);
# -------------------------------------------------------


# middle plot: dataset that has entries under 120 minutes duration
# -------------------------------------------------------
plt.subplot(1, 3, 2)

ax = sb.regplot(data = not_low_traffic, x = 'rentals', y = 'returns', color = 'c',
           fit_reg = False, scatter_kws = {'alpha' : 1/2});

sb.regplot(data = low_traffic, x = 'rentals', y = 'returns', color = 'orange',
           fit_reg = False, scatter_kws = {'alpha' : 1/2}, ax = ax);

ax.set(xlim=(-10, 100))
ax.set(ylim=(-10, 100));

# improve pot aesthetics
plt.title('Traffic under 100\n',  weight = 'bold', fontsize = 16, color = 'dimgrey')
plt.xlabel('\nBike Rentals (thousands)', fontsize = 14)
plt.ylabel('', fontsize = 14)
plt.xticks(fontsize = 12)
plt.yticks(fontsize = 12);
# -------------------------------------------------------


# right plot: dataset that has entries under 30 minutes duration
# -------------------------------------------------------
plt.subplot(1, 3, 3)

ax = sb.regplot(data = not_low_traffic, x = 'rentals', y = 'returns', color = 'c',
           fit_reg = False, scatter_kws = {'alpha' : 1/2});

sb.regplot(data = low_traffic, x = 'rentals', y = 'returns', color = 'orange',
           fit_reg = False, scatter_kws = {'alpha' : 1/2}, ax = ax);

ax.set(xlim=(-1, 10))
ax.set(ylim=(-1, 10));
# improve pot aesthetics
plt.title('Traffic under 10\n',  weight = 'bold', fontsize = 16, color = 'dimgrey')
plt.xlabel('\nBike Rentals (thousands)', fontsize = 14)
plt.ylabel('', fontsize = 14)
plt.xticks(fontsize = 12)
plt.yticks(fontsize = 12);
# -------------------------------------------------------


plt.subplots_adjust(wspace=0.2, hspace=0.3)
plt.subplots_adjust(top=0.75)
plt.suptitle("Distribution of bike station's traffic\n", fontsize = 18, weight = 'bold');

# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.10.4 Distribution of bike stations traffic.png', dpi=300, bbox_inches='tight')

In the above plot, the yellow markers represent the bike stations with very low rental activity (bike rentals and returns combined).

Observation:

The above plor depicts that there exist some stations with relatively low bike activity (rentals + returns) and deemed as high maintenance. The said stations does not even constitute to 10 bike activities(rentals and returns combined). Hence these stations are financially not suitable for further maintainance and need to be terminated/relocated.

3.10.5 Identification of the stations with low bike activity for further action:

In [197]:
# extract the stations with low bike activity
low_activity = stations[(stations['rentals'] < 10) & (stations['returns'] < 10)]
low_activity
Out[197]:
station_id rentals returns
79 4143 2.0 2
155 4327 1.0 1
184 4362 1.0 1
185 4363 8.0 4
188 4373 2.0 3
270 4490 1.0 1
275 4321 0.0 1
276 4467 0.0 6
277 4468 0.0 3

Display the list of the stations with low bike activity:

In [198]:
# display the id of the stations with low bike activity
print('Low activity bike stations:')
print('-'*27)
for i, station in enumerate(low_activity.station_id.values):
    print('{}. Station ID: {}'.format(i+1, station))
Low activity bike stations:
---------------------------
1. Station ID: 4143
2. Station ID: 4327
3. Station ID: 4362
4. Station ID: 4363
5. Station ID: 4373
6. Station ID: 4490
7. Station ID: 4321
8. Station ID: 4467
9. Station ID: 4468

----------------------------------------------

3.10.6 Insight:

  1. The stations with ID's (4143, 4327, 4362, 4363, 4373, 4490, 4321, 4467, 4468) has very low bike activity (rentals and returns combined) and deemed as high maintainence. The said stations does not even constitute to 10 bike activities over the span of 3 years.

3.10.7 Reform proposed:

  1. Hence these stations are financially not suitable for further business and need to be either terminated or relocated to locations with potential bike traffic.

3.11 Is there a gap between the demand and supply of the bikes at any given time in a day? If yes, propose a model for reducing the gap.

  • Column: start_station_id, end_station_id, start_time, end_time
  • Data type: (categorical, ordinal), (categorical, ordinal), (numerical, continuous), (numerical, continuous)
  • Plot : line plot, scatter plot

Let us observe the aggregated hourly bike rentals and bike returns subjected to each hour in a day, to identify the gap between bike demand and supply.

3.11.1 Distribution of average bike rentals and bike returns over the hour of the day:

In [199]:
# Assign grid and figure size
plt.figure(figsize = [8, 6])
sb.set_style('darkgrid')

# prepare the data for the plot
x1 = bikeshare.groupby(bikeshare['start_time'].dt.hour).count()['trip_id'].index
y1 = bikeshare.groupby(bikeshare['start_time'].dt.hour).count()['trip_id'].values
x2 = bikeshare.groupby(bikeshare['end_time'].dt.hour).count()['trip_id'].index
y2 = bikeshare.groupby(bikeshare['end_time'].dt.hour).count()['trip_id'].values
x_tick_values = np.arange(0,  23+1, 1)
x_tick_names = ['{:}'.format(v) for v in x_tick_values]
y_tick_values = np.arange(0,  bikeshare.start_time.dt.hour.value_counts().max()+10000, 10000)
y_tick_names = ['{:0.0f} K'.format(v/1000) for v in y_tick_values]

# plot matplotlib's line plot
plt.plot(x1, y1, linewidth=2.0, color = 'lightskyblue', alpha = 0.5)
plt.plot(x2, y2, linewidth=2.0, color = 'orange', alpha = 0.5)

# improve plot aesthetics
plt.title('Distribution of hourly rentals and returns\n', fontsize = 16, weight = 'bold')
plt.xlabel('\nHour of the day', fontsize = 14)
plt.ylabel('Count (thousands)\n', fontsize = 14)
plt.xticks(x_tick_values, x_tick_names, fontsize = 12)
plt.yticks(y_tick_values, y_tick_names, fontsize = 12)
plt.fill_between(x1, y1, color = 'lightskyblue', alpha = 0.5)
plt.fill_between(x2, y2, color = 'orange', alpha = 0.5)

# draw the vertical axial line at the peak hour
start_peak_hour = bikeshare['start_time'].dt.hour.value_counts(ascending=False).index[0]
plt.axvline(start_peak_hour, color='black', alpha=0.3, linewidth=2)
end_peak_hour = bikeshare['end_time'].dt.hour.value_counts(ascending=False).index[0]
plt.axvline(end_peak_hour, color='pink', alpha=0.3, linewidth=2);

# add custom legend 
custom_lines = [Line2D([0], [0], color= 'lightskyblue', lw=2),
                Line2D([0], [0], color= 'orange', lw=2)]

plt.legend(custom_lines, ['Rentals', 'Returns'], scatterpoints=1, frameon=True, fancybox=True, shadow=False, 
           ncol = 1, framealpha = 1, borderpad=1, borderaxespad=1, loc = 'upper right', labelspacing=0.5,  
           title='Metro Bike', title_fontsize=12, fontsize=10, facecolor='white', 
           markerfirst=True, handlelength=2, handletextpad=0.5, bbox_to_anchor=(1.3, 1));

# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.11.1 Distribution of hourly bike rentals and returns.png', dpi=300, bbox_inches='tight')

The above plot depicts that:

  • The availability of the bikes are higher than demand during Early Hours (0:00 AM - 5:00 AM).
  • The availability of the bikes are slighly lower than demand during Morning and Post Morning (5:00 AM - 13:00 PM).
  • The availability of the bikes are higher than demand during Evenings and Nights (14:00 PM - 23:00 PM).

However, both bike rentals and bike returns are plotted over the aggregation of 3 years (2017 - 2019). Hence plot the average bike rentals and bike returns over the individual years for any hidden insights.

3.11.2 Distribution of average bike rentals and bike returns over the years:

In [200]:
# create a dataset for the bike rentals for each hour in a respective day
start_df = bikeshare.groupby([bikeshare['start_time'].dt.year, 
                              bikeshare['start_time'].dt.month,
                              bikeshare['start_time'].dt.day,
                              bikeshare['start_time'].dt.hour],as_index=False).size()
start_df = start_df.rename_axis(['year','month', 'day', 'hour']).reset_index(name='rentals')
start_df['rentals'] = start_df['rentals'].fillna(0).astype(int)
start_df.head()
Out[200]:
year month day hour rentals
0 2017 1 1 0 9
1 2017 1 1 1 5
2 2017 1 1 2 8
3 2017 1 1 3 2
4 2017 1 1 4 1
In [201]:
# create a dataset for the bike returns for each hour in a respective day
end_df = bikeshare.groupby([bikeshare['end_time'].dt.year, 
                            bikeshare['end_time'].dt.month,
                            bikeshare['end_time'].dt.day,
                            bikeshare['end_time'].dt.hour],as_index=False).size()
end_df = end_df.rename_axis(['year','month', 'day', 'hour']).reset_index(name='returns')
end_df['returns'] = end_df['returns'].fillna(0).astype(int)
end_df.head()
Out[201]:
year month day hour returns
0 2017 1 1 0 7
1 2017 1 1 1 2
2 2017 1 1 2 13
3 2017 1 1 3 2
4 2017 1 1 5 3

Point plot:

In [202]:
def point_subplot(subplot, year):
    # plot the distribution of bike rentals based on category types
    #-----------------------Start of subplot-----------------------
    
    # prepare the data for the plot
    sb.set_style('dark')
    plt.subplot(1, 4, subplot)
    start_year_df = start_df[ start_df['year'] == year ]
    end_year_df = end_df[ end_df['year'] == year ]

    #plot point plots for bike rentals and bike returns over the year
    ax = sb.pointplot(data = start_year_df, x = "hour", y = "rentals", linestyles = "-", 
                      color = 'lightskyblue', ci=None, markers = '')
    ax = sb.pointplot(data = end_year_df, x = "hour", y = "returns", linestyles = "-", 
                      color = 'orange', ci=None, ax =ax, markers = '')

    # obtain the two lines from the axes to generate shading
    l1 = ax.lines[0]
    l2 = ax.lines[1]

    # Get the xy data from the lines so that we can shade
    x1 = l1.get_xydata()[:,0]
    y1 = l1.get_xydata()[:,1]
    x2 = l2.get_xydata()[:,0]
    y2 = l2.get_xydata()[:,1]
    
    # fill the area under the individual lines
    ax.fill_between(x1,y1, color='lightskyblue', alpha=0.5)
    ax.fill_between(x2,y2, color='orange', alpha=0.5);
    
    # improve plot aesthetics
    plt.title('Year = {}\n'.format(year), fontsize = 14, weight = 'bold', color = 'dimgrey')
    plt.xlabel('\nHour of the day', fontsize = 14)
    plt.ylabel('Count (thousands)\n', fontsize = 14)
    plt.xticks(fontsize = 10)
    locs, labels = plt.yticks()
    if subplot == 1:
        plt.ylabel('Count (thousands)\n', fontsize = 14)
        plt.yticks(fontsize = 10)
    else:
        plt.ylabel('')
        plt.yticks(locs, [])
    return ax
    #-------------------------End of subplot------------------------


# Assign grid and figure size
plt.figure(figsize = [24, 6])
sb.set_style('dark')

# plot subplots over years
ax1 = point_subplot(subplot = 1, year = 2017)
ax2 = point_subplot(subplot = 2, year = 2018)
ax3 = point_subplot(subplot = 3, year = 2019)

# adjust the plots to have the same y axis limits
if ax1.get_ylim()[1] < ax2.get_ylim()[1]:
    if ax2.get_ylim()[1] < ax3.get_ylim()[1]:
        ax1.set_ylim(ax3.get_ylim())
        ax2.set_ylim(ax3.get_ylim())
    else:
        ax1.set_ylim(ax2.get_ylim())
        ax3.set_ylim(ax2.get_ylim())
else:
    if ax1.get_ylim()[1] < ax3.get_ylim()[1]:
        ax1.set_ylim(ax3.get_ylim())
        ax2.set_ylim(ax3.get_ylim())
    else:
        ax2.set_ylim(ax1.get_ylim())
        ax3.set_ylim(ax1.get_ylim())


plt.subplots_adjust(wspace=0.05, hspace=0.3);
plt.subplots_adjust(top=0.8)
plt.suptitle('Distribution of average bike rentals and bike returns over the years\n', fontsize = 16, weight = 'bold');

# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.11.2 Average bike rentals and bike returns over years.png', dpi=300, bbox_inches='tight')

The above plot reinforces the previous observations on the distribution of bike demand and supply over the hour of the day. In the time span of 3 years, the only period of time where the bike supply falls short of demand is during Mornings (8:00 Am - 1:00 PM). However the gap is very lean and does not require any immediate attention.

----------------------------------------------

3.11.3 Insight:

  1. A window period of 6 hours between (8:00 AM - 14:00 PM) experiences a shortage of supply in bikes compared to demand in bikes by the customers. However the gap in supply and demand is very lean and does not require any immediate attention.

3.12 Does bike rental traffic equally distributed over the start stations? If not, how to better optimize the start stations to increase their rental activity?

  • Column: start_station_id
  • Data type: categorical, ordinal
  • Plot : Distribution plot, pie chart, bar chart

3.12.1 Logarithmic distribution of start_stations bike rentals:

Calculate the respective bike rentals subjected to each start station.

In [203]:
# find the rentals based on start_station_id
start_stations = bikeshare.groupby([bikeshare['start_station_id']]).size().reset_index(name='rentals')
start_stations.head()
Out[203]:
start_station_id rentals
0 3005 35009
1 3006 15863
2 3007 14985
3 3008 11620
4 3009 55

Explore the Logarithmic distribution of start stations bike rentals:

In [204]:
def log_trans(x, inverse = False):
    if not inverse:
        return np.log10(x)
    else:
        return 10 ** x


sb.set_style('white')

# prepare the data for the plot
min_value = log_trans(start_stations['rentals'].min())
max_value = log_trans(start_stations['rentals'].max())
bin_edges = np.arange(0, max_value+0.1, 0.1)

# matplotlib's histogram
plt.hist(start_stations['rentals'].apply(log_trans), bins = bin_edges, color = 'darkturquoise')

# improve plot aesthetics
plt.title('Logarithmic distribution of start stations bike rentals\n', fontsize = 14, weight = 'bold')
plt.xlabel('\nNumber of bike rentals', fontsize = 12)
plt.ylabel('Number of Start stations\n', fontsize = 12);
tick_locs = np.arange(0, max_value+1, 1)
plt.xticks(tick_locs, log_trans(tick_locs, inverse = True).astype(int), fontsize = 10)
plt.yticks(fontsize = 10)

# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.12.1 Logarithmic distribution of start stations bike rentals.png', dpi=300, bbox_inches='tight')

Breakdown the bike rental traffic at the start stations based on the above plot.

3.12.2 Classification of start_stations based on their rental traffic:

Create a dataframe based on bike rentals traffic and number of start stations associated with them.

In [205]:
rentals = {'rental_traffic' : pd.Series(['Very Low', 'Low', 'Normal', 'High', 'Very High']), 
           'start_stations' : pd.Series([start_stations.query(' rentals < 10 ').shape[0], 
                                         start_stations.query(' rentals >= 10 and rentals < 100 ').shape[0],
                                         start_stations.query(' rentals >= 100 and rentals < 1000 ').shape[0], 
                                         start_stations.query(' rentals >= 1000 and rentals < 10000 ').shape[0], 
                                         start_stations.query(' rentals >= 10000 ').shape[0]])} 
  
# create the Dataframe. 
bike_rentals = pd.DataFrame(rentals)
bike_rentals
Out[205]:
rental_traffic start_stations
0 Very Low 7
1 Low 44
2 Normal 102
3 High 92
4 Very High 28

Plot the distribution of start stations bike rentals traffic.

In [206]:
def absolute_value(val):
    '''returns absolute count of start statioins to plot in 
    the pie chart as annotations using the auto_pct function'''
    a  = np.round(val/100.*type_level_counts.sum(), 0)
    return int(a)


# Assign grid and figure size
plt.figure(figsize = [12, 5])
sb.set_style('white')

# left plot: Pie chart
# =====================================================
# /////////////////////////////////////////////////////
plt.subplot(1, 2, 1)

# prepare the data for the plot
type_level_counts = bike_rentals.start_stations.values
type_level_index = bike_rentals.rental_traffic.values
explode = (0.2, 0, 0, 0, 0)
colors = ['paleturquoise', 'darkturquoise', 'darkturquoise', 'darkturquoise', 'darkturquoise']

# matplotlib's pie chart
plt.pie(type_level_counts, labels = type_level_index, startangle = 90,
        counterclock = False, wedgeprops = {'width' : 0.4}, shadow=False, 
        explode=explode, colors=colors, textprops={'fontsize': 12}, 
        autopct='%1.0f%%', labeldistance=1.1, pctdistance=0.8)
plt.title('Percent of Stations\n\n', fontsize = 14, weight = 'bold', color = 'grey')
plt.axis('square');
# =====================================================
# /////////////////////////////////////////////////////



# right plot: Bar chart
# =====================================================
# /////////////////////////////////////////////////////
plt.subplot(1, 2, 2)

# prepare the data for the plot
counts = bike_rentals.start_stations.values
order = bike_rentals.start_stations.index
y_locs = [0, 1, 2, 3, 4]
y_labels = ['Very Low', 'Low', 'Normal', 'High', 'Very High']
clrs = [ 'darkturquoise' if (x > bike_rentals.start_stations.values.min()) else 'paleturquoise' for x in counts ]

# seaborn's bar plot
sb.barplot(x = counts, y = order, palette=clrs, alpha= 1, saturation = 0.8, orient = 'h')

# improve plot aesthetics
plt.title('Number of Stations\n\n', weight = 'bold', fontsize = 14, color = 'grey')
plt.yticks(y_locs, y_labels, rotation = 0, fontsize = 12)
plt.xticks([], [], rotation = 0, fontsize = 12) 
plt.xlabel('', fontsize = 14)
plt.ylabel('', fontsize = 14)

# add annotations
# -------------------------------------------------------
# loop through each pair of locations and labels
for loc, count in zip(y_locs, counts):
    pct_string = '{:0.0f}'.format(count)
    
    # print the annotation based on bar length
    if count < int(max(counts)/10):
        plt.text(count+int(max(counts)/25), loc+0.1, pct_string, ha = 'center', color = 'black', weight = 'bold', fontsize = 13)
    else:
        plt.text(count-int(max(counts)/15), loc+0.1, pct_string, ha = 'center', color = 'white', fontsize = 13)
# -------------------------------------------------------

sb.despine(fig=None, ax=None, top=True, right=True, left=True, bottom=True, offset=None, trim=False);
# =====================================================
# /////////////////////////////////////////////////////


plt.subplots_adjust(wspace=0.2, hspace=0.3)
plt.subplots_adjust(top=0.7)
plt.suptitle('Classification of start stations based on Rental Traffic\n', fontsize = 16, weight = 'bold');

# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.12.2 Classification of start stations based on Rental Traffic.png', dpi=300, bbox_inches='tight')
  • The above plot depicts that there exists start stations with very low bike rental activity. This denotes that the bike rental traffic is not equally distributed over the start stations.

  • However this does not imply that these start stations are to be eliminated as they might incur good bike return traffic and still prove to be a station that procure acceptable business metrics.

Deeper exploration of start stations rental behaviour for hidden insights:

3.12.3 Distribution of start stations rental traffic categorized by trip type:

Obtain the rentals subjected to each start station categorized over trip type:

In [207]:
# create a dataframe with start stations rentals over trip type
start_stations = bikeshare.groupby([bikeshare['start_station_id'], 
                                    bikeshare['trip_type']]).size().reset_index(name='rentals')
start_stations.head()
Out[207]:
start_station_id trip_type rentals
0 3005 One Way 32071
1 3005 Round Trip 2938
2 3006 One Way 14171
3 3006 Round Trip 1692
4 3007 One Way 14092

Categorize the rental traffic values into categorical sections:

In [208]:
#use pd.cut function can attribute the values into its specific bins
bin = [0,10,100,1000,10000,100000]
category = pd.cut(start_stations['rentals'],bin)
category = category.to_frame()
category.columns = ['rental_bins']
category['trip_type'] = start_stations['trip_type']
category['start_station_id'] = start_stations['start_station_id']
category = category.reindex(columns=['start_station_id', 'trip_type', 'rental_bins'])
category.head()
Out[208]:
start_station_id trip_type rental_bins
0 3005 One Way (10000, 100000]
1 3005 Round Trip (1000, 10000]
2 3006 One Way (10000, 100000]
3 3006 Round Trip (1000, 10000]
4 3007 One Way (10000, 100000]
In [209]:
# obtain the unique categorical rental bins
category.rental_bins.sort_values(ascending=True).unique()
Out[209]:
[(0, 10], (10, 100], (100, 1000], (1000, 10000], (10000, 100000]]
Categories (5, interval[int64]): [(0, 10] < (10, 100] < (100, 1000] < (1000, 10000] < (10000, 100000]]

Label the rental bins:

In [210]:
%%time

def label_race(df):
    df.loc[df['rental_bins'] == df.rental_bins.sort_values(ascending=True).unique()[0],'traffic'] = 'Very Low'
    df.loc[df['rental_bins'] == df.rental_bins.sort_values(ascending=True).unique()[1],'traffic'] = 'Low'
    df.loc[df['rental_bins'] == df.rental_bins.sort_values(ascending=True).unique()[2],'traffic'] = 'Normal'
    df.loc[df['rental_bins'] == df.rental_bins.sort_values(ascending=True).unique()[3],'traffic'] = 'High'
    df.loc[df['rental_bins'] == df.rental_bins.sort_values(ascending=True).unique()[4],'traffic'] = 'Very High'
    
df = category
label_race(df)

level_order = ['Very Low', 'Low', 'Normal', 'High', 'Very High']
ordered_cat = pd.api.types.CategoricalDtype(ordered = True, categories = level_order)
df['traffic'] = df['traffic'].astype(ordered_cat)

category.traffic.value_counts()
Wall time: 49 ms
Out[210]:
Normal       211
Low          153
High         122
Very Low      30
Very High     24
Name: traffic, dtype: int64

Prepare a dataframe to categorize start stations over rental traffic and trip_type:

In [211]:
# prepare a dataframe to categorize start stations over rental traffic and trip_type
temp_df = category.groupby([category['traffic'], category['trip_type']]).size().reset_index(name='start_stations')
temp_df
Out[211]:
traffic trip_type start_stations
0 Very Low One Way 8
1 Very Low Round Trip 22
2 Low One Way 57
3 Low Round Trip 96
4 Normal One Way 100
5 Normal Round Trip 111
6 High One Way 81
7 High Round Trip 41
8 Very High One Way 24

Data Dashboard:

plot the distribution of Start station traffic based on trip type:

In [212]:
def add_annotations(category_var, hue_var, counts_var, x_annotations, y_annotations, alignments):
    '''add custom annotations to the plots based on hue and category'''
    labels = temp_df[category_var].sort_values(ascending=True).unique()
    hues = temp_df[hue_var].sort_values(ascending=True).unique()
    for loc, var in enumerate(hues):
        cat_counts = temp_df[temp_df[hue_var] == var][counts_var].values
        for i, label in enumerate(labels):
            try:
                pct_string = '{:0.0f}'.format(cat_counts[i])
                plt.text(i + x_annotations[i], cat_counts[i] + y_annotations[i], pct_string, 
                         ha = alignments[i], color = 'black', fontsize = 13, 
                         bbox=dict(pad=1.9,alpha=0.2,color='none',fc=sb.color_palette()[loc]))
            except IndexError:
                continue
            

# Assign grid and figure size
fig, axes = plt.subplots(figsize=(16,11), nrows=2, ncols=2) # grid of 3x4 subplots
axes = axes.flatten() # reshape from 3x4 array into 12-element vector
sb.set_style('white')


# Plot 0 : Clustered Bar chart
# =====================================================
# /////////////////////////////////////////////////////
plt.sca(axes[0])

# Assign palette as per requirement
sb.set_palette('GnBu', n_colors=5, desat=0.6)

# plot clustered bar chart
g = sb.countplot(data = category, x = 'trip_type', hue = 'traffic', alpha = 0.8, saturation = 0.8)

# improve plot aesthetics
plt.title('Distribution of Start stations rental traffic over trip type\n\n', fontsize = 16, weight = 'bold')
plt.xlabel('\nBike Rental traffic', fontsize = 14)
plt.ylabel('Number of Start stations\n', fontsize = 14)
plt.yticks([], fontsize = 12)
plt.xticks(fontsize = 12)
sb.despine(left=True)

# plot legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, ncol = 1, framealpha = 1, 
           borderpad=1, borderaxespad=1, bbox_to_anchor = (0.85, 0.8), loc = 6, labelspacing=0.5,  
           title='Rental Traffic', title_fontsize=12, fontsize=10, facecolor='white', 
           markerfirst=True, handlelength=2, handletextpad=0.5)

# add annotations
for p in g.patches:
    g.annotate(format(p.get_height(), '.0f'), (p.get_x() + p.get_width() / 2., p.get_height()), 
               ha = 'center', va = 'center', xytext = (0, 12), textcoords = 'offset points', fontsize = 12)

# plot vertical axial lines for categorical separation
separators = [0.5]
for loc in separators:
    plt.axvline(loc, ls='--', color='grey', linewidth=1, alpha=0.4);
# =====================================================
# /////////////////////////////////////////////////////


# Plot 1 : Clustered Bar chart
# =====================================================
# /////////////////////////////////////////////////////
plt.sca(axes[1])

# Assign palette as per requirement
flatui = ['darkturquoise', 'paleturquoise']
sb.set_palette(flatui, desat = 0.6)

# plot clustered bar chart
g = sb.countplot(data = category, x = 'traffic', hue = 'trip_type', alpha = 0.8, saturation = 0.8)

# improve plot aesthetics
plt.title('Classification of Start stations by rental traffic & trip type\n\n', fontsize = 16, weight = 'bold')
plt.xlabel('\nBike Rental traffic', fontsize = 14)
plt.ylabel('', fontsize = 14)
plt.yticks([], fontsize = 12)
plt.xticks(fontsize = 12)
sb.despine(left=True)

# plot legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, ncol = 1, framealpha = 1, 
           borderpad=1, borderaxespad=1, bbox_to_anchor = (0.8, 0.9), loc = 6, labelspacing=0.5,  
           title='Trip type', title_fontsize=12, fontsize=10, facecolor='white', 
           markerfirst=True, handlelength=2, handletextpad=0.5)

# add annotations
for p in g.patches:
    g.annotate(format(p.get_height(), '.0f'), (p.get_x() + p.get_width() / 2., p.get_height()), 
               ha = 'center', va = 'center', xytext = (0, 12), textcoords = 'offset points', fontsize = 12)
# =====================================================
# /////////////////////////////////////////////////////


# Plot 2 : Point plot
# =====================================================
# /////////////////////////////////////////////////////
plt.sca(axes[2])

# Assign color palette
flatui = ['#d62d6e', '#d6b42d', '#2dd6ce', '#2da6d6', '#395091']
sb.set_palette(flatui, n_colors=5, desat=0.8)

# Seaborn's ploint plot
ax1 = sb.pointplot(data = temp_df, x = 'trip_type', y = 'start_stations', hue = 'traffic');

# improve plot aesthetics
plt.title('Distribution of Start stations rental traffic over trip type\n\n', fontsize = 16, weight = 'bold')
plt.xlabel('\nTrip Type', fontsize = 14)
plt.ylabel('Number of Start stations\n', fontsize = 14)
plt.yticks([], fontsize = 12)
plt.xticks(fontsize = 12)

# add annotations
x_annotations = [-0.05, 0.05]
y_annotations = [0, 0]
alignments = ['right', 'left']
add_annotations('trip_type', 'traffic', 'start_stations', x_annotations, y_annotations, alignments)

# add legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, framealpha = 1, 
           borderpad=1, borderaxespad=1, loc = 'upper left', labelspacing=0.5, ncol = 1,  
           title='Rental Traffic', title_fontsize=12, fontsize=10, facecolor='white', 
           markerfirst=True, handlelength=2, handletextpad=0.5, bbox_to_anchor=(0.85, 1))

sb.despine(top=True, bottom=True, left=True, right=True, ax = ax1);
# =====================================================
# /////////////////////////////////////////////////////


# Plot 3 : Point plot
# =====================================================
# /////////////////////////////////////////////////////
plt.sca(axes[3])

flatui = ['darkturquoise', 'paleturquoise']
sb.set_palette(flatui, desat = 0.6)

# plot clustered bar chart
ax2 = sb.pointplot(data = temp_df, x = 'traffic', y = 'start_stations', hue = 'trip_type');

# improve plot aesthetics
plt.title('Classification of Start stations by rental traffic & trip type\n\n', fontsize = 16, weight = 'bold')
plt.xlabel('\nBike Rental traffic', fontsize = 14)
plt.ylabel('', fontsize = 14)
plt.yticks(fontsize = 12)
plt.xticks(fontsize = 12)

# modify characteristics of each line
for i, line in enumerate(ax2.lines):
    line.set_markevery(every=None)
    line.set_marker('o')        
    line.set_markersize(8)
    line.set_markeredgewidth(2)
    line.set_markerfacecolor('#ffffff')
    try:
        base_color = sb.color_palette()[i]
        line.set_markeredgecolor(base_color)
    except IndexError:
        continue

# plot legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, ncol = 1, framealpha = 1, 
           borderpad=1, borderaxespad=1, bbox_to_anchor = (0.8, 0.9), loc = 6, labelspacing=0.5,  
           title='Trip type', title_fontsize=12, fontsize=10, facecolor='white', 
           markerfirst=True, handlelength=2, handletextpad=0.5)

# add annotations
x_annotations = [0, 0, 0, 0, 0]
y_annotations = [6, 6, 6, 6, 6]
alignments = ['center', 'center', 'center', 'center', 'center']
# add_annotations('traffic', 'trip_type', 'start_stations', x_annotations, y_annotations, alignments)

sb.despine(top=True, bottom=False, left=False, right=True, ax = ax2);
# =====================================================
# /////////////////////////////////////////////////////

plt.subplots_adjust(wspace=0.3, hspace=0.7)

# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.12.3 Start stations rental traffic categorized by trip type.png', dpi=300, bbox_inches='tight')

Observations:

  • The above plot depicts that the 8 start stations constitute Very Low rental traffic and 57 start stations constitute Low rental traffic subjected to One Way trips, while 22 start stations experience Very Low rental traffic and 96 start stations constitute Low rental traffic subjected to Round Trips. This denotes that the more number of stations experience low and very low rental traffic subjected to Round Trips.
  • Also the number of start stations that experience High and Very high bike rental traffic for Round Trips is less than that of One Way trips.
  • This reveals the need of improving rental traffic in start stations subjected to Round Trips.

Reform:

  • Discounts or promotions should be announced for Round Trips at start stations which experiences Low and Very Low rental traffic, to encourage customers to rent bikes from these particular staions.

3.12.4 Distribution of start stations rental traffic categorized by bike type:

Obtain the rentals subjected to each start station categorized over bike type:

In [213]:
# create a dataframe with start stations rentals over bike type
start_stations = bikeshare.groupby([bikeshare['start_station_id'], 
                                    bikeshare['bike_type']]).size().reset_index(name='rentals')
start_stations.head()
Out[213]:
start_station_id bike_type rentals
0 3005 unknown 17536
1 3005 Standard 11158
2 3005 Electric 6315
3 3006 unknown 7734
4 3006 Standard 4645

Categorize the rental traffic values into categorical sections:

In [214]:
#use pd.cut function can attribute the values into its specific bins
bin = [0,10,100,1000,10000,100000]
category = pd.cut(start_stations['rentals'],bin)
category = category.to_frame()
category.columns = ['rental_bins']
category['bike_type'] = start_stations['bike_type']
category['start_station_id'] = start_stations['start_station_id']
category = category.reindex(columns=['start_station_id', 'bike_type', 'rental_bins'])
category.head()
Out[214]:
start_station_id bike_type rental_bins
0 3005 unknown (10000, 100000]
1 3005 Standard (10000, 100000]
2 3005 Electric (1000, 10000]
3 3006 unknown (1000, 10000]
4 3006 Standard (1000, 10000]
In [215]:
# obtain the unique categorical rental bins
category.rental_bins.sort_values(ascending=True).unique()
Out[215]:
[(0, 10], (10, 100], (100, 1000], (1000, 10000], (10000, 100000]]
Categories (5, interval[int64]): [(0, 10] < (10, 100] < (100, 1000] < (1000, 10000] < (10000, 100000]]

Label the rental bins:

In [216]:
%%time

def label_race(df):
    df.loc[df['rental_bins'] == df.rental_bins.sort_values(ascending=True).unique()[0],'traffic'] = 'Very Low'
    df.loc[df['rental_bins'] == df.rental_bins.sort_values(ascending=True).unique()[1],'traffic'] = 'Low'
    df.loc[df['rental_bins'] == df.rental_bins.sort_values(ascending=True).unique()[2],'traffic'] = 'Normal'
    df.loc[df['rental_bins'] == df.rental_bins.sort_values(ascending=True).unique()[3],'traffic'] = 'High'
    df.loc[df['rental_bins'] == df.rental_bins.sort_values(ascending=True).unique()[4],'traffic'] = 'Very High'
    
df = category
label_race(df)

level_order = ['Very Low', 'Low', 'Normal', 'High', 'Very High']
ordered_cat = pd.api.types.CategoricalDtype(ordered = True, categories = level_order)
df['traffic'] = df['traffic'].astype(ordered_cat)

category.traffic.value_counts()
Wall time: 43 ms
Out[216]:
Normal       213
High         183
Low           85
Very Low      16
Very High     13
Name: traffic, dtype: int64

Prepare a dataframe to categorize start stations over rental traffic and bike_type:

In [217]:
# prepare a dataframe to categorize start stations over rental traffic and bike_type
temp_df = category.groupby([category['traffic'], category['bike_type']]).size().reset_index(name='start_stations')
temp_df
Out[217]:
traffic bike_type start_stations
0 Very Low unknown 1
1 Very Low Standard 3
2 Very Low Electric 7
3 Very Low Smart 5
4 Low unknown 4
5 Low Standard 29
6 Low Electric 26
7 Low Smart 26
8 Normal unknown 34
9 Normal Standard 67
10 Normal Electric 68
11 Normal Smart 44
12 High unknown 85
13 High Standard 55
14 High Electric 37
15 High Smart 6
16 Very High unknown 10
17 Very High Standard 3

Data Dashboard:

plot the distribution of Start station traffic based on bike type:

In [218]:
def add_annotations(category_var, hue_var, counts_var, x_annotations, y_annotations, alignments):
    '''add custom annotations to the plots based on hue and category'''
    labels = temp_df[category_var].sort_values(ascending=True).unique()
    hues = temp_df[hue_var].sort_values(ascending=True).unique()
    for loc, var in enumerate(hues):
        cat_counts = temp_df[temp_df[hue_var] == var][counts_var].values
        for i, label in enumerate(labels):
            try:
                pct_string = '{:0.0f}'.format(cat_counts[i])
                plt.text(i + x_annotations[i], cat_counts[i] + y_annotations[i], pct_string, 
                         ha = alignments[i], color = 'black', fontsize = 13, 
                         bbox=dict(pad=1.9,alpha=0.2,color='none',fc=sb.color_palette()[loc]))
            except IndexError:
                continue
            

# Assign grid and figure size
fig, axes = plt.subplots(figsize=(16,11), nrows=2, ncols=2) # grid of 3x4 subplots
axes = axes.flatten() # reshape from 3x4 array into 12-element vector
sb.set_style('white')


# Plot 0 : Clustered Bar chart
# =====================================================
# /////////////////////////////////////////////////////
plt.sca(axes[0])

# Assign palette as per requirement
sb.set_palette('GnBu', n_colors=5, desat=0.6)

# plot clustered bar chart
g = sb.countplot(data = df, x = 'bike_type', hue = 'traffic', alpha = 0.8, saturation = 1)

# improve plot aesthetics
plt.title('Distribution of Start stations rental traffic over bike type\n\n', fontsize = 16, weight = 'bold')
plt.xlabel('\nBike Type', fontsize = 14)
plt.ylabel('Number of Start stations\n', fontsize = 14)
plt.yticks([], fontsize = 12)
plt.xticks(fontsize = 12)
sb.despine(top=True, left=True, right=True, bottom=False)

# plot legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, ncol = 1, framealpha = 1, 
           borderpad=1, borderaxespad=1, bbox_to_anchor = (0.75, 1.1), loc = 'upper left', labelspacing=0.5,  
           title='Rental Traffic', title_fontsize=12, fontsize=10, facecolor='white', 
           markerfirst=True, handlelength=2, handletextpad=0.5)

# add annotations
for p in g.patches:
    g.annotate(format(p.get_height(), '.0f'), (p.get_x() + p.get_width() / 2., p.get_height()), 
               ha = 'center', va = 'center', xytext = (0, 10), textcoords = 'offset points', fontsize = 12)

separators = [0.5, 1.5, 2.5, 3.5]
for loc in separators:
    plt.axvline(loc, ls='--', color='grey', linewidth=1, alpha=0.4);
# =====================================================
# /////////////////////////////////////////////////////


# Plot 1 : Clustered Bar chart
# =====================================================
# /////////////////////////////////////////////////////
plt.sca(axes[1])

# Assign palette as per requirement
sb.set_palette('deep', n_colors=4, desat=0.6)

# plot clustered bar chart
g = sb.countplot(data = category, x = 'traffic', hue = 'bike_type', alpha = 0.8, saturation = 1)

# improve plot aesthetics
plt.title('Classification of Start stations by rental traffic & bike type\n\n', fontsize = 16, weight = 'bold')
plt.xlabel('\nBike Rental traffic', fontsize = 14)
plt.ylabel('', fontsize = 14)
plt.yticks([], fontsize = 12)
plt.xticks(fontsize = 12)
sb.despine(left=True)

# plot legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, ncol = 1, framealpha = 1, 
           borderpad=1, borderaxespad=1, bbox_to_anchor = (0.8, 0.8), loc = 6, labelspacing=0.5,  
           title='Bike type', title_fontsize=12, fontsize=10, facecolor='white', 
           markerfirst=True, handlelength=2, handletextpad=0.5)

# add annotations
for p in g.patches:
    g.annotate(format(p.get_height(), '.0f'), (p.get_x() + p.get_width() / 2., p.get_height()), 
               ha = 'center', va = 'center', xytext = (0, 10), textcoords = 'offset points', fontsize = 12)

separators = [0.5, 1.5, 2.5, 3.5]
for loc in separators:
    plt.axvline(loc, ls='--', color='grey', linewidth=1, alpha=0.4);
# =====================================================
# /////////////////////////////////////////////////////


# Plot 2 : Point plot
# =====================================================
# /////////////////////////////////////////////////////
plt.sca(axes[2])

# Assign color palette
flatui = ['#d62d6e', '#d6b42d', '#2dd6ce', '#2da6d6', '#395091']
# sb.set_palette(flatui, n_colors=5, desat=0.8)
sb.set_palette('GnBu', n_colors=5, desat=0.6)

# Seaborn's ploint plot
ax1 = sb.pointplot(data = temp_df, x = 'bike_type', y = 'start_stations', hue = 'traffic');

# improve plot aesthetics
plt.title('Distribution of Start stations rental traffic over bike type\n\n', fontsize = 16, weight = 'bold')
plt.xlabel('\nBike Type', fontsize = 14)
plt.ylabel('Number of Start stations\n', fontsize = 14)
plt.yticks(fontsize = 12)
plt.xticks(fontsize = 12)

# modify characteristics of each line
for i, line in enumerate(ax1.lines):
    line.set_markevery(every=None)
    line.set_marker('o')        
    line.set_markersize(8)
    line.set_markeredgewidth(2)
    line.set_markerfacecolor('#ffffff')
    try:
        base_color = sb.color_palette()[i]
        line.set_markeredgecolor(base_color)
    except IndexError:
        continue

# add annotations
x_annotations = [0, 0, 0, 0, 0]
y_annotations = [0, 0, 0, 0, 0]
alignments = ['center', 'center', 'center', 'center', 'center']
# add_annotations('pass_type', 'traffic', 'start_stations', x_annotations, y_annotations, alignments)

# add legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, framealpha = 1, 
           borderpad=1, borderaxespad=1, loc = 'upper left', labelspacing=0.5, ncol = 1,  
           title='Rental Traffic', title_fontsize=12, fontsize=10, facecolor='white', 
           markerfirst=True, handlelength=2, handletextpad=0.5, bbox_to_anchor=(0.8, 1.15))

sb.despine(top=True, bottom=False, left=False, right=True, ax = ax1);
# =====================================================
# /////////////////////////////////////////////////////


# Plot 3 : Point plot
# =====================================================
# /////////////////////////////////////////////////////
plt.sca(axes[3])

# Assign palette as per requirement
sb.set_palette('deep', n_colors=4, desat=0.6)

# plot clustered bar chart
ax2 = sb.pointplot(data = temp_df, x = 'traffic', y = 'start_stations', hue = 'bike_type');

# improve plot aesthetics
plt.title('Classification of Start stations by rental traffic & bike type\n\n', fontsize = 16, weight = 'bold')
plt.xlabel('\nBike Rental traffic', fontsize = 14)
plt.ylabel('', fontsize = 14)
plt.yticks(fontsize = 12)
plt.xticks(fontsize = 12)

# modify characteristics of each line
for i, line in enumerate(ax2.lines):
    line.set_markevery(every=None)
    line.set_marker('o')        
    line.set_markersize(8)
    line.set_markeredgewidth(2)
    line.set_markerfacecolor('#ffffff')
    try:
        base_color = sb.color_palette()[i]
        line.set_markeredgecolor(base_color)
    except IndexError:
        continue

# plot legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, ncol = 1, framealpha = 1, 
           borderpad=1, borderaxespad=1, bbox_to_anchor = (0.8, 0.9), loc = 6, labelspacing=0.5,  
           title='Bike type', title_fontsize=12, fontsize=10, facecolor='white', 
           markerfirst=True, handlelength=2, handletextpad=0.5)

# add annotations
x_annotations = [0, 0, 0, 0, 0]
y_annotations = [0, 0, 0, 0, 0]
alignments = ['center', 'center', 'center', 'center', 'center']
# add_annotations('traffic', 'pass_type', 'start_stations', x_annotations, y_annotations, alignments)

sb.despine(top=True, bottom=False, left=False, right=True, ax = ax2);
# =====================================================
# /////////////////////////////////////////////////////

plt.subplots_adjust(wspace=0.3, hspace=0.7)

# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.12.4 Start stations rental traffic categorized by bike type.png', dpi=300, bbox_inches='tight')

Observations:

  • The above plot depicts that there are very small number of start stations subjected to bike types, that experience Very Low bike rental activity, which is a good sign for healthy business model. However, the number of start stations that experince Very High rental activity is also very small. This limits the start stations from utilizing its full potential.
  • As the number of start stations that experiences Low and Very Low bike rental activity are clustered closely, this represents that the bike type has no influence on the rental activity at these stations and there might be other factors that caused the decrease in bike rental activity at these stations. So no action subjected to bike type is required to increase the bike rental activity at start stations with Low and Very Low rental activity.
  • Smart bike has less number of start stations with Normal and High rental traffic compared to other bike types.

Reform:

  • Discounts or promotions should be announced for Round Trips at start stations which experiences Low and Very Low rental traffic, to encourage customers to rent bikes from these particular staions.
  • Annouce promotions on Smart bikes to increase their rental activity at start stations.

3.12.5 Distribution of start stations rental traffic categorized by pass type:

Obtain the rentals subjected to each start station categorized over pass type:

In [219]:
# create a dataframe with start stations rentals over pass type
start_stations = bikeshare.groupby([bikeshare['start_station_id'], 
                                    bikeshare['pass_type']]).size().reset_index(name='rentals')
start_stations.head()
Out[219]:
start_station_id pass_type rentals
0 3005 Walk-up 2303
1 3005 One Day 4845
2 3005 Monthly 25338
3 3005 Flex 14
4 3005 Annual 2509

Categorize the rental traffic values into categorical sections:

In [220]:
#use pd.cut function can attribute the values into its specific bins
bin = [0,10,100,1000,10000,100000]
category = pd.cut(start_stations['rentals'],bin)
category = category.to_frame()
category.columns = ['rental_bins']
category['pass_type'] = start_stations['pass_type']
category['start_station_id'] = start_stations['start_station_id']
category = category.reindex(columns=['start_station_id', 'pass_type', 'rental_bins'])
category.head()
Out[220]:
start_station_id pass_type rental_bins
0 3005 Walk-up (1000, 10000]
1 3005 One Day (1000, 10000]
2 3005 Monthly (10000, 100000]
3 3005 Flex (10, 100]
4 3005 Annual (1000, 10000]
In [221]:
# obtain the unique categorical rental bins
category.rental_bins.sort_values(ascending=True).unique()
Out[221]:
[(0, 10], (10, 100], (100, 1000], (1000, 10000], (10000, 100000]]
Categories (5, interval[int64]): [(0, 10] < (10, 100] < (100, 1000] < (1000, 10000] < (10000, 100000]]

Label the rental bins:

In [222]:
%%time

def label_race(df):
    df.loc[df['rental_bins'] == df.rental_bins.sort_values(ascending=True).unique()[0],'traffic'] = 'Very Low'
    df.loc[df['rental_bins'] == df.rental_bins.sort_values(ascending=True).unique()[1],'traffic'] = 'Low'
    df.loc[df['rental_bins'] == df.rental_bins.sort_values(ascending=True).unique()[2],'traffic'] = 'Normal'
    df.loc[df['rental_bins'] == df.rental_bins.sort_values(ascending=True).unique()[3],'traffic'] = 'High'
    df.loc[df['rental_bins'] == df.rental_bins.sort_values(ascending=True).unique()[4],'traffic'] = 'Very High'
    
df = category
label_race(df)

level_order = ['Very Low', 'Low', 'Normal', 'High', 'Very High']
ordered_cat = pd.api.types.CategoricalDtype(ordered = True, categories = level_order)
df['traffic'] = df['traffic'].astype(ordered_cat)

category.traffic.value_counts()
Wall time: 37 ms
Out[222]:
Normal       335
Low          304
High         179
Very Low     120
Very High     13
Name: traffic, dtype: int64

Prepare a dataframe to categorize start stations over rental traffic and pass_type.

In [223]:
# prepare a dataframe to categorize start stations over rental traffic and pass_type
temp_df = category.groupby([category['traffic'], 
                            category['pass_type']]).count()['start_station_id'].reset_index(name='start_stations')
temp_df.head(10)
Out[223]:
traffic pass_type start_stations
0 Very Low Walk-up 3.0
1 Very Low One Day 28.0
2 Very Low Monthly 16.0
3 Very Low Flex 21.0
4 Very Low Annual 52.0
5 Low Walk-up 9.0
6 Low One Day 80.0
7 Low Monthly 63.0
8 Low Flex 7.0
9 Low Annual 145.0

Data Dashboard:

Plot the distribution of Start station traffic based on pass type:

In [224]:
def add_annotations(category_var, hue_var, counts_var, x_annotations, y_annotations, alignments):
    '''add custom annotations to the plot based on hue and category'''
    labels = temp_df[category_var].sort_values(ascending=True).unique()
    hues = temp_df[hue_var].sort_values(ascending=True).unique()
    for loc, var in enumerate(hues):
        cat_counts = temp_df[temp_df[hue_var] == var][counts_var].values
        for i, label in enumerate(labels):
            try:
                pct_string = '{:0.0f}'.format(cat_counts[i])
                plt.text(i + x_annotations[i], cat_counts[i] + y_annotations[i], pct_string, 
                         ha = alignments[i], color = 'black', fontsize = 13, 
                         bbox=dict(pad=1.9,alpha=0.2,color='none',fc=sb.color_palette()[loc]))
            except KeyError:
                continue


# Assign grid and figure size
fig, axes = plt.subplots(figsize=(16,11), nrows=2, ncols=2) # grid of 3x4 subplots
axes = axes.flatten() # reshape from 3x4 array into 12-element vector
sb.set_style('white')


# Plot 0 : Clustered Bar chart
# =====================================================
# /////////////////////////////////////////////////////
plt.sca(axes[0])

# Assign palette as per requirement
sb.set_palette('GnBu', n_colors=5, desat=0.6)

# plot clustered bar chart
g = sb.countplot(data = df, x = 'pass_type', hue = 'traffic', alpha = 0.8, saturation = 1)

# improve plot aesthetics
plt.title('Distribution of Start stations rental traffic over pass type\n\n', fontsize = 16, weight = 'bold')
plt.xlabel('\nPass Type', fontsize = 14)
plt.ylabel('Number of Start stations\n\n', fontsize = 14)
plt.yticks([], fontsize = 12)
plt.xticks(fontsize = 12)
sb.despine(top=True, left=True, right=True, bottom=False)

# plot legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, ncol = 1, framealpha = 1, 
           borderpad=1, borderaxespad=1, bbox_to_anchor = (-0.08, 1.15), loc = 'upper left', labelspacing=0.5,  
           title='Rental Traffic', title_fontsize=12, fontsize=10, facecolor='white', 
           markerfirst=True, handlelength=2, handletextpad=0.5)

# add annotations
for p in g.patches:
    g.annotate(format(p.get_height(), '.0f'), (p.get_x() + p.get_width() / 2., p.get_height()), 
               ha = 'center', va = 'center', xytext = (0, 10), textcoords = 'offset points', fontsize = 12)

separators = [0.5, 1.5, 2.5, 3.5]
for loc in separators:
    plt.axvline(loc, ls='--', color='grey', linewidth=1, alpha=0.4);
# =====================================================
# /////////////////////////////////////////////////////



# Plot 1 : Clustered Bar chart
# =====================================================
# /////////////////////////////////////////////////////
plt.sca(axes[1])

# Assign palette as per requirement
flatui = ["#26bda7", "#9b59b6", "#3498db", "#e74c3c", "#34495e"]
sb.set_palette(flatui)

# plot clustered bar chart
g = sb.countplot(data = category, x = 'traffic', hue = 'pass_type', alpha = 0.8, saturation = 1)

# improve plot aesthetics
plt.title('Classification of Start stations by rental traffic & pass type\n\n', fontsize = 16, weight = 'bold')
plt.xlabel('\nBike Rental traffic', fontsize = 14)
plt.ylabel('', fontsize = 14)
plt.yticks([], fontsize = 12)
plt.xticks(fontsize = 12)
sb.despine(top=True, left=True, right=True, bottom=False)

# plot legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, ncol = 1, framealpha = 1, 
           borderpad=1, borderaxespad=1, bbox_to_anchor = (0.8, 0.9), loc = 6, labelspacing=0.5,  
           title='Pass type', title_fontsize=12, fontsize=10, facecolor='white', 
           markerfirst=True, handlelength=2, handletextpad=0.5)

# add annotations
for p in g.patches:
    g.annotate(format(p.get_height(), '.0f'), (p.get_x() + p.get_width() / 2., p.get_height()), 
               ha = 'center', va = 'center', xytext = (0, 10), textcoords = 'offset points', fontsize = 12)

separators = [0.5, 1.5, 2.5, 3.5]
for loc in separators:
    plt.axvline(loc, ls='--', color='grey', linewidth=1, alpha=0.4);
# =====================================================
# /////////////////////////////////////////////////////



# Plot 2 : Point plot
# =====================================================
# /////////////////////////////////////////////////////
plt.sca(axes[2])

# Assign color palette
flatui = ['#d62d6e', '#d6b42d', '#2dd6ce', '#2da6d6', '#395091']
sb.set_palette('GnBu', n_colors=5, desat=0.8)

# Seaborn's ploint plot
ax1 = sb.pointplot(data = temp_df, x = 'pass_type', y = 'start_stations', hue = 'traffic');

# improve plot aesthetics
plt.title('Distribution of Start stations rental traffic over pass type\n\n', fontsize = 16, weight = 'bold')
plt.xlabel('\nPass Type', fontsize = 14)
plt.ylabel('Number of Start stations\n', fontsize = 14)
plt.yticks(fontsize = 12)
plt.xticks(fontsize = 12)

# modify characteristics of each line
for i, line in enumerate(ax1.lines):
    line.set_markevery(every=None)
    line.set_marker('o')        
    line.set_markersize(8)
    line.set_markeredgewidth(2)
    line.set_markerfacecolor('#ffffff')
    try:
        base_color = sb.color_palette()[i]
        line.set_markeredgecolor(base_color)
    except IndexError:
        continue

# add annotations
x_annotations = [0, 0, 0, 0, 0]
y_annotations = [0, 0, 0, 0, 0]
alignments = ['center', 'center', 'center', 'center', 'center']
# add_annotations('pass_type', 'traffic', 'start_stations', x_annotations, y_annotations, alignments)

# add legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, framealpha = 1, 
           borderpad=1, borderaxespad=1, loc = 'upper left', labelspacing=0.5, ncol = 1,  
           title='Rental Traffic', title_fontsize=12, fontsize=10, facecolor='white', 
           markerfirst=True, handlelength=2, handletextpad=0.5, bbox_to_anchor=(0, 1.15))

sb.despine(top=True, bottom=False, left=False, right=True, ax = ax1);
# =====================================================
# /////////////////////////////////////////////////////


# Plot 3 : Point plot
# =====================================================
# /////////////////////////////////////////////////////
plt.sca(axes[3])

# Assign palette as per requirement
flatui = ["#26bda7", "#9b59b6", "#3498db", "#e74c3c", "#34495e"]
sb.set_palette(flatui)

# plot clustered bar chart
ax2 = sb.pointplot(data = temp_df, x = 'traffic', y = 'start_stations', hue = 'pass_type');

# improve plot aesthetics
plt.title('Classification of Start stations by rental traffic & pass type\n\n', fontsize = 16, weight = 'bold')
plt.xlabel('\nBike Rental traffic', fontsize = 14)
plt.ylabel('', fontsize = 14)
plt.yticks(fontsize = 12)
plt.xticks(fontsize = 12)

# modify characteristics of each line
for i, line in enumerate(ax2.lines):
    line.set_markevery(every=None)
    line.set_marker('o')        
    line.set_markersize(8)
    line.set_markeredgewidth(2)
    line.set_markerfacecolor('#ffffff')
    try:
        base_color = sb.color_palette()[i]
        line.set_markeredgecolor(base_color)
    except IndexError:
        continue

# plot legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, ncol = 1, framealpha = 1, 
           borderpad=1, borderaxespad=1, bbox_to_anchor = (0.8, 0.9), loc = 6, labelspacing=0.5,  
           title='Pass type', title_fontsize=12, fontsize=10, facecolor='white', 
           markerfirst=True, handlelength=2, handletextpad=0.5)

# add annotations
x_annotations = [0, 0, 0, 0, 0]
y_annotations = [0, 0, 0, 0, 0]
alignments = ['center', 'center', 'center', 'center', 'center']
# add_annotations('traffic', 'pass_type', 'start_stations', x_annotations, y_annotations, alignments)

sb.despine(top=True, bottom=False, left=False, right=True, ax = ax2);

# =====================================================
# /////////////////////////////////////////////////////

plt.subplots_adjust(wspace=0.3, hspace=0.7)

# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.12.5 Start stations rental traffic categorized by pass type.png', dpi=300, bbox_inches='tight')

Observations:

  • The above plot depicts that a major number of start stations subjected to Annual pass experiences either Low or Very Low rental traffic. As Annual pass is a long-term subscription, this behaviour is to be expected.
  • All the start stations subjected to Flex pass are compromised into Low and Very Low bike rental traffic. This is because Flex pass is originally issued for testing puspose for employees. Hence this insight is ignored.
  • Bike rentals subjected to Walk-up pass are ignored as it is discontinued after the year 2018.
  • It appears that a fair number of start stations experience Low rental traffic subjected to Monthly pass type. As Monthly pass is mostly preferred by working individuals, any action taken will not encourage them to switch their start station preference, which might result in delay to their ride to work. Hence no action is required to increase rental activity subjected to Monthly pass type at start stations with Low rentnal activity.
  • There exists many start stations with relatively Low bike rental activity, subjected to One Day pass.

Reform:

  • Promotions should be announced to increase the rental traffic subjected to One Day passes at start stations with Low bike rental activity.

3.12.6 Distribution of start stations rental traffic categorized by fare type:

Obtain the rentals subjected to each start station categorized over fare type:

In [225]:
# create a dataframe with start stations rentals over fare type
start_stations = bikeshare.groupby([bikeshare['start_station_id'], 
                                    bikeshare['fare_type']]).size().reset_index(name='rentals')
start_stations.head()
Out[225]:
start_station_id fare_type rentals
0 3005 Base 32141
1 3005 Extended 2868
2 3006 Base 14653
3 3006 Extended 1210
4 3007 Base 13863

Categorize the rental traffic values into categorical sections:

In [226]:
#use pd.cut function can attribute the values into its specific bins
bin = [0,10,100,1000,10000,100000]
category = pd.cut(start_stations['rentals'],bin)
category = category.to_frame()
category.columns = ['rental_bins']
category['fare_type'] = start_stations['fare_type']
category['start_station_id'] = start_stations['start_station_id']
category = category.reindex(columns=['start_station_id', 'fare_type', 'rental_bins'])
category.head()
Out[226]:
start_station_id fare_type rental_bins
0 3005 Base (10000, 100000]
1 3005 Extended (1000, 10000]
2 3006 Base (10000, 100000]
3 3006 Extended (1000, 10000]
4 3007 Base (10000, 100000]
In [227]:
# obtain the unique categorical rental bins
category.rental_bins.sort_values(ascending=True).unique()
Out[227]:
[(0, 10], (10, 100], (100, 1000], (1000, 10000], (10000, 100000]]
Categories (5, interval[int64]): [(0, 10] < (10, 100] < (100, 1000] < (1000, 10000] < (10000, 100000]]

Label the rental bins:

In [228]:
%%time

def label_race(df):
    df.loc[df['rental_bins'] == df.rental_bins.sort_values(ascending=True).unique()[0],'traffic'] = 'Very Low'
    df.loc[df['rental_bins'] == df.rental_bins.sort_values(ascending=True).unique()[1],'traffic'] = 'Low'
    df.loc[df['rental_bins'] == df.rental_bins.sort_values(ascending=True).unique()[2],'traffic'] = 'Normal'
    df.loc[df['rental_bins'] == df.rental_bins.sort_values(ascending=True).unique()[3],'traffic'] = 'High'
    df.loc[df['rental_bins'] == df.rental_bins.sort_values(ascending=True).unique()[4],'traffic'] = 'Very High'
    
df = category
label_race(df)

level_order = ['Very Low', 'Low', 'Normal', 'High', 'Very High']
ordered_cat = pd.api.types.CategoricalDtype(ordered = True, categories = level_order)
df['traffic'] = df['traffic'].astype(ordered_cat)

category.traffic.value_counts()
Wall time: 39 ms
Out[228]:
Normal       209
Low          146
High         122
Very Low      38
Very High     25
Name: traffic, dtype: int64

Prepare a dataframe to categorize start stations over rental traffic and fare_type.

In [229]:
# prepare a dataframe to categorize start stations over rental traffic and fare_type
temp_df = category.groupby([category['traffic'], 
                            category['fare_type']]).count()['start_station_id'].reset_index(name='start_stations')
temp_df.head(10)
Out[229]:
traffic fare_type start_stations
0 Very Low Base 12
1 Very Low Extended 26
2 Low Base 53
3 Low Extended 93
4 Normal Base 101
5 Normal Extended 108
6 High Base 84
7 High Extended 38
8 Very High Base 23
9 Very High Extended 2

Data Dashboard:

Plot the distribution of Start station traffic based on fare type:

In [230]:
def add_annotations(category_var, hue_var, counts_var, x_annotations, y_annotations, alignments):
    '''add custom annotations to the plots based on hue and category'''
    labels = temp_df[category_var].sort_values(ascending=True).unique()
    hues = temp_df[hue_var].sort_values(ascending=True).unique()
    for loc, var in enumerate(hues):
        cat_counts = temp_df[temp_df[hue_var] == var][counts_var].values
        for i, label in enumerate(labels):
            try:
                pct_string = '{:0.0f}'.format(cat_counts[i])
                plt.text(i + x_annotations[i], cat_counts[i] + y_annotations[i], pct_string, 
                         ha = alignments[i], color = 'black', fontsize = 13, 
                         bbox=dict(pad=1.9,alpha=0.2,color='none',fc=sb.color_palette()[loc]))
            except IndexError:
                continue
            

# Assign grid and figure size
fig, axes = plt.subplots(figsize=(16,11), nrows=2, ncols=2) # grid of 3x4 subplots
axes = axes.flatten() # reshape from 3x4 array into 12-element vector
sb.set_style('white')


# Plot 0 : Clustered Bar chart
# =====================================================
# /////////////////////////////////////////////////////
plt.sca(axes[0])

# Assign palette as per requirement
sb.set_palette('GnBu', n_colors = 5, desat = 0.6)
# plot clustered bar chart
g = sb.countplot(data = category, x = 'fare_type', hue = 'traffic', alpha = 0.8, saturation = 0.8)

# improve plot aesthetics
plt.title('Distribution of Start stations rental traffic over fare type\n\n', fontsize = 16, weight = 'bold')
plt.xlabel('\nFare Type', fontsize = 14)
plt.ylabel('Number of Start stations\n', fontsize = 14)
plt.yticks([], fontsize = 12)
plt.xticks(fontsize = 12)
sb.despine(left=True)

# plot legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, ncol = 1, framealpha = 1, 
           borderpad=1, borderaxespad=1, bbox_to_anchor = (0.85, 0.8), loc = 6, labelspacing=0.5,  
           title='Rental Traffic', title_fontsize=12, fontsize=10, facecolor='white', 
           markerfirst=True, handlelength=2, handletextpad=0.5)

# add annotations
# -------------------------------------------------------
for p in g.patches:
    g.annotate(format(p.get_height(), '.0f'), (p.get_x() + p.get_width() / 2., p.get_height()), 
               ha = 'center', va = 'center', xytext = (0, 10), textcoords = 'offset points', fontsize = 12)
# -------------------------------------------------------

separators = [0.5]
for loc in separators:
    plt.axvline(loc, ls='--', color='grey', linewidth=1, alpha=0.4);
# =====================================================
# /////////////////////////////////////////////////////


# Plot 1 : Clustered Bar chart
# =====================================================
# /////////////////////////////////////////////////////
plt.sca(axes[1])

# Assign palette as per requirement
flatui = ['#fcd605', '#fae887']
sb.set_palette(flatui, desat = 0.6)

# plot clustered bar chart
g = sb.countplot(data = category, x = 'traffic', hue = 'fare_type', alpha = 0.8, saturation = 0.8)

# improve plot aesthetics
plt.title('Classification of Start stations by rental traffic & fare type\n\n', fontsize = 16, weight = 'bold')
plt.xlabel('\nBike Rental traffic', fontsize = 14)
plt.ylabel('', fontsize = 14)
plt.yticks([], fontsize = 12)
plt.xticks(fontsize = 12)
sb.despine(left=True)

# plot legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, ncol = 1, framealpha = 1, 
           borderpad=1, borderaxespad=1, bbox_to_anchor = (0.8, 0.9), loc = 6, labelspacing=0.5,  
           title='Fare type', title_fontsize=12, fontsize=10, facecolor='white', 
           markerfirst=True, handlelength=2, handletextpad=0.5)

# add annotations
for p in g.patches:
    g.annotate(format(p.get_height(), '.0f'), (p.get_x() + p.get_width() / 2., p.get_height()), 
               ha = 'center', va = 'center', xytext = (0, 10), textcoords = 'offset points', fontsize = 12)
# =====================================================
# /////////////////////////////////////////////////////


# Plot 2 : Point plot
# =====================================================
# /////////////////////////////////////////////////////
plt.sca(axes[2])

# Assign color palette
flatui = ['#d62d6e', '#d6b42d', '#2dd6ce', '#2da6d6', '#395091']
sb.set_palette(flatui, n_colors=5, desat=0.8)

# Seaborn's ploint plot
ax1 = sb.pointplot(data = temp_df, x = 'fare_type', y = 'start_stations', hue = 'traffic');

# improve plot aesthetics
plt.title('Distribution of Start stations rental traffic over fare type\n\n', fontsize = 16, weight = 'bold')
plt.xlabel('\nFare Type', fontsize = 14)
plt.ylabel('Number of Start stations\n', fontsize = 14)
plt.yticks([], fontsize = 12)
plt.xticks(fontsize = 12)

# add annotations
x_annotations = [-0.05, 0.05]
y_annotations = [0, 0]
alignments = ['right', 'left']
add_annotations('fare_type', 'traffic', 'start_stations', x_annotations, y_annotations, alignments)

# add legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, framealpha = 1, 
           borderpad=1, borderaxespad=1, loc = 'upper left', labelspacing=0.5, ncol = 1,  
           title='Rental Traffic', title_fontsize=12, fontsize=10, facecolor='white', 
           markerfirst=True, handlelength=2, handletextpad=0.5, bbox_to_anchor=(0.85, 1))

sb.despine(top=True, bottom=True, left=True, right=True, ax = ax1);
# =====================================================
# /////////////////////////////////////////////////////


# Plot 3 : Point plot
# =====================================================
# /////////////////////////////////////////////////////
plt.sca(axes[3])

# Assign palette as per requirement
flatui = ['#fcd605', '#fae887']
sb.set_palette(flatui, desat = 0.6)

# Seaborn's pointplot
ax2 = sb.pointplot(data = temp_df, x = 'traffic', y = 'start_stations', hue = 'fare_type');

# improve plot aesthetics
plt.title('Classification of Start stations by rental traffic & fare type\n\n', fontsize = 16, weight = 'bold')
plt.xlabel('\nBike Rental traffic', fontsize = 14)
plt.ylabel('', fontsize = 14)
plt.yticks(fontsize = 12)
plt.xticks(fontsize = 12)

# modify characteristics of each line
for i, line in enumerate(ax2.lines):
    line.set_markevery(every=None)
    line.set_marker('o')        
    line.set_markersize(8)
    line.set_markeredgewidth(2)
    line.set_markerfacecolor('#ffffff')
    try:
        base_color = sb.color_palette()[i]
        line.set_markeredgecolor(base_color)
    except IndexError:
        continue

# plot legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, ncol = 1, framealpha = 1, 
           borderpad=1, borderaxespad=1, bbox_to_anchor = (0.8, 0.9), loc = 6, labelspacing=0.5,  
           title='Fare type', title_fontsize=12, fontsize=10, facecolor='white', 
           markerfirst=True, handlelength=2, handletextpad=0.5)

# add annotations
x_annotations = [0, 0, 0, 0, 0]
y_annotations = [6, 6, 6, 6, 6]
alignments = ['center', 'center', 'center', 'center', 'center']
# add_annotations('traffic', 'fare_type', 'end_stations', x_annotations, y_annotations, alignments)

sb.despine(top=True, bottom=False, left=False, right=True, ax = ax2);
# =====================================================
# /////////////////////////////////////////////////////

plt.subplots_adjust(wspace=0.3, hspace=0.7)

# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.12.6 Start stations rental traffic categorized by fare type.png', dpi=300, bbox_inches='tight')

Observations:

  • The above plot depicts that a major number of start stations subjected to Extended fare types experience Low rental traffic. Action should be taken to encourage customers to take longer trips at these stations to increase income generation.
  • Also a major number of start stations subjected to Base fare experience Normal and Higher rental traffic. This is a good sign for healthy business model with higher average in bike rental traffic over time.

Reform:

  • Promotions should be announced to encourage customers to take longer trips at the stations with Low bike rental traffic subjected to Extended Fares, to increase income generation.

----------------------------------------------

3.12.7 Insights:

  1. The bike rental traffic is not equally distributed over the start stations. However this does not imply that these start stations are to be eliminated as they might incur good bike return traffic and still prove to be a station that procure acceptable business metrics.
  2. A major number of stations subjected to Round Trips experience Low and Very Low rental traffic. This reveals the need of improving rental traffic at the start stations subjected to Round Trips.
  3. The number of start stations that experience High and Very high bike rental traffic for Round Trips is less than that of One Way trips. This denotes that One Way trips are more popular among the customers.
  4. There exists very small number of start stations subjected to bike types, that experience Very Low bike rental activity, which is a good sign for healthy business model. However, the number of start stations that experince Very High rental activity is also very small. This limits the usage of start stations from serving its full potential.
  5. As the number of start stations that experiences Low and Very Low bike rental activity subjected to bike type are clustered closely, this unravels that the bike type has no influence on the rental activity at these stations and there might be other factors that caused the decrease in bike rental activity at these specitic stations. So no action subjected to bike type is required to increase the bike rental activity at start stations with Low and Very Low rental activity.
  6. Smart bike has less number of start stations with Normal and High rental traffic compared to other bike types. This reflects that Smart bikes requires more advertisement and awareness among customers.
  7. A major number of start stations subjected to Annual pass experiences either Low or Very Low rental traffic. As Annual pass is a long-term subscription, this behaviour is to be expected.
  8. All the start stations subjected to Flex pass are compromised into Low and Very Low bike rental traffic. This is because Flex pass is originally issued for testing puspose for employees. Hence this insight is ignored.
  9. It appears that a fair number of start stations experience Low rental traffic subjected to Monthly pass type. As Monthly pass is mostly preferred by working individuals, any action taken will not encourage them to switch their start station preference, which might result in delay to their ride to work. Hence no action is required to increase rental activity subjected to Monthly pass type at start stations with Low rentnal activity.
  10. There exists many start stations with relatively Low bike rental activity, subjected to One Day pass. This might be due to the influence of its geographical location or acquisition of bike rentals related to other customer pass types.
  11. A major number of start stations subjected to Extended fare types experience Low rental traffic. Action should be taken to encourage customers to take longer trips at these stations to increase income generation.
  12. A major number of start stations subjected to Base fare experience Normal and Higher rental traffic. This is a good sign for healthy business model with higher average in bike rental traffic over time.

3.12.8 Reforms proposed:

  1. Discounts or promotional activities should be announced for Round Trips to encourage customers to rent bikes from the staions with Low and Very Low bike rental activity subjected to Round Trips.
  2. Annouce promotions on Smart bikes to increase their rental activity at start stations.
  3. Promotions should be announced to increase the rental traffic subjected to One Day passes at start stations with Low bike rental activity.
  4. Promotions should be announced to encourage customers to take longer trips at the stations with Low bike rental traffic subjected to Extended Fares, to increase income generation.

3.13 Does bike rental traffic equally distributed over the end stations? If not, how to better optimize the start stations to increase their bike return activity?

  • Column: end_station_id
  • Data type: categorical, ordinal
  • Plot : Distribution plot, pie chart, bar chart

As the bike rentals and bike returns follow a linear relation, it is to be taken into account through out the end stations analysis.

3.13.1 Logarithmic distribution of start_stations bike rentals:

Calculate the respective bike returns subjected to each end station.

In [231]:
# find the bike returns based on end_station_id
end_stations = bikeshare.groupby([bikeshare['end_station_id']]).size().reset_index(name='returns')
end_stations.head()
Out[231]:
end_station_id returns
0 3005 38639
1 3006 16430
2 3007 11910
3 3008 11912
4 3009 68

Explore the Logarithmic distribution of end stations bike returns:

In [232]:
def log_trans(x, inverse = False):
    if not inverse:
        return np.log10(x)
    else:
        return 10 ** x


sb.set_style('white')

# prepare the data for the plot
min_value = log_trans(end_stations['returns'].min())
max_value = log_trans(end_stations['returns'].max())
bin_edges = np.arange(0, max_value+0.1, 0.1)

# matplotlib's histogram
plt.hist(end_stations['returns'].apply(log_trans), bins = bin_edges, color = 'salmon')

# improve plot aesthetics
plt.title('Logarithmic distribution of end stations bike returns\n', fontsize = 14, weight = 'bold')
plt.xlabel('\nNumber of bike returns', fontsize = 12)
plt.ylabel('Number of End stations\n', fontsize = 12);
tick_locs = np.arange(0, max_value+1, 1)
plt.xticks(tick_locs, log_trans(tick_locs, inverse = True).astype(int), fontsize = 10)
plt.yticks(fontsize = 10)

# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.13.1 Logarithmic distribution of end stations bike returns.png', dpi=300, bbox_inches='tight')

Breakdown the bike return traffic at the End stations based on the above plot.

3.13.2 Classification of end_stations based on their bike return traffic:

Create a dataframe based on bike return traffic and number of end stations associated with them.

In [233]:
returns = {'return_traffic' : pd.Series(['Very Low', 'Low', 'Normal', 'High', 'Very High']), 
           'end_stations' : pd.Series([end_stations.query(' returns < 10 ').shape[0], 
                                       end_stations.query(' returns >= 10 and returns < 100 ').shape[0],
                                       end_stations.query(' returns >= 100 and returns < 1000 ').shape[0], 
                                       end_stations.query(' returns >= 1000 and returns < 10000 ').shape[0], 
                                       end_stations.query(' returns >= 10000 ').shape[0]])} 
  
# create Dataframe. 
bike_returns = pd.DataFrame(returns)
bike_returns
Out[233]:
return_traffic end_stations
0 Very Low 10
1 Low 43
2 Normal 108
3 High 91
4 Very High 26

Plot the distribution of end stations bike return traffic.

In [234]:
def absolute_value(val):
    '''returns absolute count of end statioins to plot in the 
    pie chart as annotations using the auto_pct function'''
    a  = np.round(val/100.*type_level_counts.sum(), 0)
    return int(a)


# Assign grid and figure size
plt.figure(figsize = [12, 5])
sb.set_style('white')


# left plot: Pie chart
# =====================================================
# /////////////////////////////////////////////////////
plt.subplot(1, 2, 1)

# prepare the data for the plot
type_level_counts = bike_returns.end_stations.values
type_level_index = bike_returns.return_traffic.values
explode = (0.2, 0, 0, 0, 0)
colors = ['bisque', 'salmon', 'salmon', 'salmon', 'salmon']

# matplotlib's pie chart
plt.pie(type_level_counts, labels = type_level_index, startangle = 90,
        counterclock = False, wedgeprops = {'width' : 0.4}, shadow=False, 
        explode=explode, colors=colors, textprops={'fontsize': 12}, 
        autopct='%1.0f%%', labeldistance=1.1, pctdistance=0.8)
plt.title('Percent of Stations\n\n', fontsize = 14, weight = 'bold', color = 'grey')
plt.axis('square');
# =====================================================
# /////////////////////////////////////////////////////


# right plot: Bar chart
# =====================================================
# /////////////////////////////////////////////////////
plt.subplot(1, 2, 2)

# Assign grid and color palette as per requirement
base_color = sb.color_palette()[0]

# prepare the data for the plot
counts = bike_returns.end_stations.values
order = bike_returns.end_stations.index
y_locs = [0, 1, 2, 3, 4]
y_labels = ['Very Low', 'Low', 'Normal', 'High', 'Very High']
clrs = [ 'salmon' if (count > bike_returns.end_stations.values.min()) else 'bisque' for count in counts ]

# Seaborn's bar chart
sb.barplot(x = counts, y = order, palette=clrs, alpha= 1, saturation = 0.8, orient = 'h')

# improve plot aesthetics
plt.title('Number of Stations\n\n', weight = 'bold', fontsize = 16, color = 'grey')
plt.yticks(y_locs, y_labels, rotation = 0, fontsize = 12)
plt.xticks([], [], rotation = 0, fontsize = 12) 
plt.xlabel('', fontsize = 14)
# plt.ylabel('Number of Stations', fontsize = 14)

# add annotations
# -------------------------------------------------------
# loop through each pair of locations and labels
for loc, count in zip(y_locs, counts):
    pct_string = '{:0.0f}'.format(count)
    
    # print the annotation based on bar length
    if count <= int(max(counts)/10):
        plt.text(count+int(max(counts)/20), loc, pct_string, ha = 'center', color = 'black', fontsize = 13)
    else:
        plt.text(count-int(max(counts)/10), loc, pct_string, ha = 'center', color = 'white', fontsize = 13)
# ------------------------------------------------------- 

sb.despine(fig=None, ax=None, top=True, right=True, left=True, bottom=True, offset=None, trim=False);
# =====================================================
# /////////////////////////////////////////////////////


plt.subplots_adjust(wspace=0.2, hspace=0.3)
plt.subplots_adjust(top=0.7)
plt.suptitle('Classification of End stations based on Bike Return Traffic\n', fontsize = 16, weight = 'bold');

# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.13.2 Classification of End stations based on Return Traffic.png', dpi=300, bbox_inches='tight')
  • The above plot depicts that there exists end stations with very low bike return activity. This denotes that the bike return traffic is not equally distributed over the end stations.

  • However this does not imply that these end stations are to be eliminated as they might incur good bike rental traffic and still prove to be a station that procure acceptable business metrics.

Deeper exploration of end stations return behaviour for hidden insights:

3.13.3 Explore the distribution of end stations rental traffic categorized by trip type:

Obtain the bike returns subjected to each end station categorized over trip type:

In [235]:
# create a dataframe with end stations returns over trip type
end_stations = bikeshare.groupby([bikeshare['end_station_id'], 
                                  bikeshare['trip_type']]).size().reset_index(name='returns')
end_stations.head()
Out[235]:
end_station_id trip_type returns
0 3005 One Way 35701
1 3005 Round Trip 2938
2 3006 One Way 14738
3 3006 Round Trip 1692
4 3007 One Way 11017

Categorize the rental traffic values into categorical sections:

In [236]:
#use pd.cut function can attribute the values into its specific bins
bin = [0,10,100,1000,10000,100000]
category = pd.cut(end_stations['returns'],bin)
category = category.to_frame()
category.columns = ['return_bins']
category['trip_type'] = end_stations['trip_type']
category['end_station_id'] = end_stations['end_station_id']
category = category.reindex(columns=['end_station_id', 'trip_type', 'return_bins'])
category.head()
Out[236]:
end_station_id trip_type return_bins
0 3005 One Way (10000, 100000]
1 3005 Round Trip (1000, 10000]
2 3006 One Way (10000, 100000]
3 3006 Round Trip (1000, 10000]
4 3007 One Way (10000, 100000]
In [237]:
# obtain the unique categorical return bins
category.return_bins.sort_values(ascending=True).unique()
Out[237]:
[(0, 10], (10, 100], (100, 1000], (1000, 10000], (10000, 100000]]
Categories (5, interval[int64]): [(0, 10] < (10, 100] < (100, 1000] < (1000, 10000] < (10000, 100000]]

Label the return bins:

In [238]:
%%time

def assign_traffic(df):
    df.loc[df['return_bins'] == df.return_bins.sort_values(ascending=True).unique()[0],'traffic'] = 'Very Low'
    df.loc[df['return_bins'] == df.return_bins.sort_values(ascending=True).unique()[1],'traffic'] = 'Low'
    df.loc[df['return_bins'] == df.return_bins.sort_values(ascending=True).unique()[2],'traffic'] = 'Normal'
    df.loc[df['return_bins'] == df.return_bins.sort_values(ascending=True).unique()[3],'traffic'] = 'High'
    df.loc[df['return_bins'] == df.return_bins.sort_values(ascending=True).unique()[4],'traffic'] = 'Very High'
    
df = category
assign_traffic(df)

# convert the 'traffic' column to ordered categorical datatype
level_order = ['Very Low', 'Low', 'Normal', 'High', 'Very High']
ordered_cat = pd.api.types.CategoricalDtype(ordered = True, categories = level_order)
df['traffic'] = df['traffic'].astype(ordered_cat)

category.traffic.value_counts()
Wall time: 44 ms
Out[238]:
Normal       216
Low          154
High         123
Very Low      32
Very High     21
Name: traffic, dtype: int64

Prepare a dataframe to categorize end stations over bike return traffic and trip_type:

In [239]:
# prepare a dataframe to categorize end stations over return traffic and trip_type
temp_df = category.groupby([category['traffic'], category['trip_type']]).size().reset_index(name='end_stations')
temp_df
Out[239]:
traffic trip_type end_stations
0 Very Low One Way 10
1 Very Low Round Trip 22
2 Low One Way 58
3 Low Round Trip 96
4 Normal One Way 105
5 Normal Round Trip 111
6 High One Way 82
7 High Round Trip 41
8 Very High One Way 21

Data Dashboard:

Plot the distribution of End station return traffic based on trip type:

In [240]:
def add_annotations(category_var, hue_var, counts_var, x_annotations, y_annotations, alignments):
    '''add custom annotations to the plots based on hue and category'''
    labels = temp_df[category_var].sort_values(ascending=True).unique()
    hues = temp_df[hue_var].sort_values(ascending=True).unique()
    for loc, var in enumerate(hues):
        cat_counts = temp_df[temp_df[hue_var] == var][counts_var].values
        for i, label in enumerate(labels):
            try:
                pct_string = '{:0.0f}'.format(cat_counts[i])
                plt.text(i + x_annotations[i], cat_counts[i] + y_annotations[i], pct_string, 
                         ha = alignments[i], color = 'black', fontsize = 13, 
                         bbox=dict(pad=1.9,alpha=0.2,color='none',fc=sb.color_palette()[loc]))
            except IndexError:
                continue


# Assign grid and figure size
fig, axes = plt.subplots(figsize=(16,11), nrows=2, ncols=2) # grid of 3x4 subplots
axes = axes.flatten() # reshape from 3x4 array into 12-element vector
sb.set_style('white')


# Plot 0 : Clustered Bar chart
# =====================================================
# /////////////////////////////////////////////////////
plt.sca(axes[0])

# Assign palette as per requirement
flatui = ['bisque', 'lightsalmon', 'darksalmon', 'salmon', 'tomato']
sb.set_palette(flatui, n_colors=5, desat=0.6)

# plot clustered bar chart
g = sb.countplot(data = category, x = 'trip_type', hue = 'traffic', alpha = 0.8, saturation = 0.9)

# improve plot aesthetics
plt.title('Distribution of End stations rental traffic over trip type\n\n', fontsize = 16, weight = 'bold')
plt.xlabel('\nReturn traffic', fontsize = 14)
plt.ylabel('Number of End stations\n', fontsize = 14)
plt.yticks([], fontsize = 12)
plt.xticks(fontsize = 12)
sb.despine(left=True)

# plot legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, ncol = 1, framealpha = 1, 
           borderpad=1, borderaxespad=1, bbox_to_anchor = (0.8, 0.8), loc = 6, labelspacing=0.5,  
           title='Return Traffic', title_fontsize=12, fontsize=10, facecolor='white', 
           markerfirst=True, handlelength=2, handletextpad=0.5)

# add annotations
# -------------------------------------------------------
for p in g.patches:
    g.annotate(format(p.get_height(), '.0f'), (p.get_x() + p.get_width() / 2., p.get_height()), 
               ha = 'center', va = 'center', xytext = (0, 10), textcoords = 'offset points', fontsize = 12)
# -------------------------------------------------------

separators = [0.5, 1.5]
for loc in separators:
    plt.axvline(loc, ls='--', color='grey', linewidth=1, alpha=0.4);
# =====================================================
# /////////////////////////////////////////////////////


# Plot 1 : Clustered Bar chart
# =====================================================
# /////////////////////////////////////////////////////
plt.sca(axes[1])

# Assign palette as per requirement
flatui = ['salmon', 'bisque']
sb.set_palette(flatui, desat = 0.8)

# plot clustered bar chart
g = sb.countplot(data = category, x = 'traffic', hue = 'trip_type', alpha = 0.8, saturation = 0.9)

# improve plot aesthetics
plt.title('Classification of End stations by rental traffic & trip type\n\n', fontsize = 16, weight = 'bold')
plt.xlabel('\nReturn traffic', fontsize = 14)
plt.ylabel('', fontsize = 14)
plt.yticks([], fontsize = 12)
plt.xticks(fontsize = 12)
sb.despine(left=True)

# plot legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, ncol = 1, framealpha = 1, 
           borderpad=1, borderaxespad=1, bbox_to_anchor = (0.8, 0.8), loc = 6, labelspacing=0.5,  
           title='Trip type', title_fontsize=12, fontsize=10, facecolor='white', 
           markerfirst=True, handlelength=2, handletextpad=0.5)

# add annotations
for p in g.patches:
    g.annotate(format(p.get_height(), '.0f'), (p.get_x() + p.get_width() / 2., p.get_height()), 
               ha = 'center', va = 'center', xytext = (0, 10), textcoords = 'offset points', fontsize = 12)
# =====================================================
# /////////////////////////////////////////////////////


# Plot 2 : Point plot
# =====================================================
# /////////////////////////////////////////////////////
plt.sca(axes[2])

# Assign color palette
# flatui = ['bisque', 'lightsalmon', 'darksalmon', 'salmon', 'tomato']
# sb.set_palette(flatui, n_colors=5, desat=0.6)
flatui = ['#d62d6e', '#d6b42d', '#2dd6ce', '#2da6d6', '#395091']
sb.set_palette(flatui, n_colors=5, desat=0.8)

# Seaborn's ploint plot
ax1 = sb.pointplot(data = temp_df, x = 'trip_type', y = 'end_stations', hue = 'traffic');

# improve plot aesthetics
plt.title('Distribution of End stations rental traffic over trip type\n\n', fontsize = 16, weight = 'bold')
plt.xlabel('\nTrip Type', fontsize = 14)
plt.ylabel('Number of End stations\n', fontsize = 14)
plt.yticks([], fontsize = 12)
plt.xticks(fontsize = 12)

# add annotations
x_annotations = [-0.05, 0.05]
y_annotations = [0, 0]
alignments = ['right', 'left']
add_annotations('trip_type', 'traffic', 'end_stations', x_annotations, y_annotations, alignments)

# add legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, framealpha = 1, 
           borderpad=1, borderaxespad=1, loc = 'upper left', labelspacing=0.5, ncol = 1,  
           title='Rental Traffic', title_fontsize=12, fontsize=10, facecolor='white', 
           markerfirst=True, handlelength=2, handletextpad=0.5, bbox_to_anchor=(0.85, 1))

sb.despine(top=True, bottom=True, left=True, right=True, ax = ax1);
# =====================================================
# /////////////////////////////////////////////////////


# Plot 3 : Point plot
# =====================================================
# /////////////////////////////////////////////////////
plt.sca(axes[3])

# Assign color palette
flatui = ['salmon', 'bisque']
sb.set_palette(flatui, desat = 0.8)

# plot clustered bar chart
ax2 = sb.pointplot(data = temp_df, x = 'traffic', y = 'end_stations', hue = 'trip_type');

# improve plot aesthetics
plt.title('Classification of End stations by rental traffic & trip type\n\n', fontsize = 16, weight = 'bold')
plt.xlabel('\nBike Rental traffic', fontsize = 14)
plt.ylabel('', fontsize = 14)
plt.yticks(fontsize = 12)
plt.xticks(fontsize = 12)

# modify characteristics of each line
for i, line in enumerate(ax2.lines):
    line.set_markevery(every=None)
    line.set_marker('o')        
    line.set_markersize(8)
    line.set_markeredgewidth(2)
    line.set_markerfacecolor('#ffffff')
    try:
        base_color = sb.color_palette()[i]
        line.set_markeredgecolor(base_color)
    except IndexError:
        continue

# plot legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, ncol = 1, framealpha = 1, 
           borderpad=1, borderaxespad=1, bbox_to_anchor = (0.8, 0.9), loc = 6, labelspacing=0.5,  
           title='Trip type', title_fontsize=12, fontsize=10, facecolor='white', 
           markerfirst=True, handlelength=2, handletextpad=0.5)

# add annotations
x_annotations = [0, 0, 0, 0, 0]
y_annotations = [6, 6, 6, 6, 6]
alignments = ['center', 'center', 'center', 'center', 'center']
# add_annotations('traffic', 'trip_type', 'start_stations', x_annotations, y_annotations, alignments)

sb.despine(top=True, bottom=False, left=False, right=True, ax = ax2);
# =====================================================
# /////////////////////////////////////////////////////


plt.subplots_adjust(wspace=0.3, hspace=0.7)

# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.13.3 End stations return traffic categorized by trip type.png', dpi=300, bbox_inches='tight')

Observations:

  • Many end stations subjected to One Way trips experience a Low bike return traffic. When Round Trips are involved, the bike rental and bike returns are subjected to one and same bike station. Hence when a station experience a Low bike return traffic, it also implies a Low bike rental traffic subjected to the same station. However, unlike Round Trips, having a Low bike returns subjected to One Way trips will create a deficit in the availabitiy of the bikes in the specific station. Hence an immediate action is required to redirect bikes from stations with High and Very High bike return traffic to stations with Low and Very Low bike return traffic subjected to One Way trips to normalize the availability of bikes over all staions.
  • Also a major number of end stations subjected to Round Trips experience Low bike return traffic.

Reform:

  • Promotions should be announced to encourage customers to opt for the Round Trips at the end stations with Low bike return traffic.

3.13.4 Explore the distribution of end stations rental traffic categorized by bike type:

Obtain the rentals subjected to each end station categorized over bike type:

In [241]:
# create a dataframe with end stations rentals over bike type
end_stations = bikeshare.groupby([bikeshare['end_station_id'], 
                                  bikeshare['bike_type']]).size().reset_index(name='returns')
end_stations.head()
Out[241]:
end_station_id bike_type returns
0 3005 unknown 19990
1 3005 Standard 12296
2 3005 Electric 6353
3 3006 unknown 7942
4 3006 Standard 5004

Categorize the rental traffic values into categorical sections:

In [242]:
#use pd.cut function can attribute the values into its specific bins
bin = [0,10,100,1000,10000,100000]
category = pd.cut(end_stations['returns'],bin)
category = category.to_frame()
category.columns = ['return_bins']
category['bike_type'] = end_stations['bike_type']
category['end_station_id'] = end_stations['end_station_id']
category = category.reindex(columns=['end_station_id', 'bike_type', 'return_bins'])
category.head()
Out[242]:
end_station_id bike_type return_bins
0 3005 unknown (10000, 100000]
1 3005 Standard (10000, 100000]
2 3005 Electric (1000, 10000]
3 3006 unknown (1000, 10000]
4 3006 Standard (1000, 10000]
In [243]:
# obtain the unique categorical return bins
category.return_bins.sort_values(ascending=True).unique()
Out[243]:
[(0, 10], (10, 100], (100, 1000], (1000, 10000], (10000, 100000]]
Categories (5, interval[int64]): [(0, 10] < (10, 100] < (100, 1000] < (1000, 10000] < (10000, 100000]]

Label the return bins:

In [244]:
%%time

def assign_traffic(df):
    df.loc[df['return_bins'] == df.return_bins.sort_values(ascending=True).unique()[0],'traffic'] = 'Very Low'
    df.loc[df['return_bins'] == df.return_bins.sort_values(ascending=True).unique()[1],'traffic'] = 'Low'
    df.loc[df['return_bins'] == df.return_bins.sort_values(ascending=True).unique()[2],'traffic'] = 'Normal'
    df.loc[df['return_bins'] == df.return_bins.sort_values(ascending=True).unique()[3],'traffic'] = 'High'
    df.loc[df['return_bins'] == df.return_bins.sort_values(ascending=True).unique()[4],'traffic'] = 'Very High'
    
df = category
assign_traffic(df)

# convert 'traffic' column to ordered categorical datatype
level_order = ['Very Low', 'Low', 'Normal', 'High', 'Very High']
ordered_cat = pd.api.types.CategoricalDtype(ordered = True, categories = level_order)
df['traffic'] = df['traffic'].astype(ordered_cat)

category.traffic.value_counts()
Wall time: 31.2 ms
Out[244]:
Normal       215
High         180
Low           87
Very Low      22
Very High     14
Name: traffic, dtype: int64

Prepare a dataframe to categorize end stations over return traffic and bike_type:

In [245]:
# prepare a dataframe to categorize end stations over return traffic and bike_type
temp_df = category.groupby([category['traffic'], category['bike_type']]).size().reset_index(name='end_stations')
temp_df
Out[245]:
traffic bike_type end_stations
0 Very Low unknown 1
1 Very Low Standard 5
2 Very Low Electric 9
3 Very Low Smart 7
4 Low unknown 5
5 Low Standard 32
6 Low Electric 27
7 Low Smart 23
8 Normal unknown 33
9 Normal Standard 66
10 Normal Electric 69
11 Normal Smart 47
12 High unknown 86
13 High Standard 52
14 High Electric 37
15 High Smart 5
16 Very High unknown 11
17 Very High Standard 3

Data Dashboard:

plot the distribution of End station traffic based on bike type:

In [246]:
def add_annotations(category_var, hue_var, counts_var, x_annotations, y_annotations, alignments):
    '''add custom annotations to the plots based on hue and category'''
    labels = temp_df[category_var].sort_values(ascending=True).unique()
    hues = temp_df[hue_var].sort_values(ascending=True).unique()
    for loc, var in enumerate(hues):
        cat_counts = temp_df[temp_df[hue_var] == var][counts_var].values
        for i, label in enumerate(labels):
            try:
                pct_string = '{:0.0f}'.format(cat_counts[i])
                plt.text(i + x_annotations[i], cat_counts[i] + y_annotations[i], pct_string, 
                         ha = alignments[i], color = 'black', fontsize = 13, 
                         bbox=dict(pad=1.9,alpha=0.2,color='none',fc=sb.color_palette()[loc]))
            except IndexError:
                continue
            

# Assign grid and figure size
fig, axes = plt.subplots(figsize=(16,11), nrows=2, ncols=2) # grid of 3x4 subplots
axes = axes.flatten() # reshape from 3x4 array into 12-element vector
sb.set_style('white')


# Plot 0 : Clustered Bar chart
# =====================================================
# /////////////////////////////////////////////////////
plt.sca(axes[0])

# Assign palette as per requirement
flatui = ['bisque', 'lightsalmon', 'darksalmon', 'salmon', 'tomato']
sb.set_palette(flatui, n_colors=5, desat=0.6)

# plot clustered bar chart
g = sb.countplot(data = category, x = 'bike_type', hue = 'traffic', alpha = 0.8, saturation = 1)

# improve plot aesthetics
plt.title('Distribution of End stations rental traffic over bike type\n\n', fontsize = 16, weight = 'bold')
plt.xlabel('\nReturn traffic', fontsize = 14)
plt.ylabel('Number of End stations\n\n', fontsize = 14)
plt.yticks([], fontsize = 12)
plt.xticks(fontsize = 12)
sb.despine(left=True)

# plot legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, ncol = 1, framealpha = 1, 
           borderpad=1, borderaxespad=1, bbox_to_anchor = (0.8, 0.9), loc = 6, labelspacing=0.5,  
           title='Return Traffic', title_fontsize=12, fontsize=10, facecolor='white', 
           markerfirst=True, handlelength=2, handletextpad=0.5)

# add annotations
# -------------------------------------------------------
for p in g.patches:
    g.annotate(format(p.get_height(), '.0f'), (p.get_x() + p.get_width() / 2., p.get_height()), 
               ha = 'center', va = 'center', xytext = (0, 10), textcoords = 'offset points')
# -------------------------------------------------------

separators = [0.5, 1.5, 2.5]
for loc in separators:
    plt.axvline(loc, ls='--', color='grey', linewidth=1, alpha=0.4);
# =====================================================
# /////////////////////////////////////////////////////


# Plot 1 : Clustered Bar chart
# =====================================================
# /////////////////////////////////////////////////////
plt.sca(axes[1])

# Assign palette as per requirement
sb.set_palette('deep', n_colors=4, desat=0.6)

# plot clustered bar chart
g = sb.countplot(data = category, x = 'traffic', hue = 'bike_type', alpha = 0.8, saturation = 1)

# improve plot aesthetics
plt.title('Classification of End stations by rental traffic & bike type\n\n', fontsize = 16, weight = 'bold')
plt.xlabel('\nReturn traffic', fontsize = 14)
plt.ylabel('', fontsize = 14)
plt.yticks([], fontsize = 12)
plt.xticks(fontsize = 12)
sb.despine(left=True)

# plot legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, ncol = 1, framealpha = 1, 
           borderpad=1, borderaxespad=1, bbox_to_anchor = (0.8, 0.9), loc = 6, labelspacing=0.5,  
           title='Bike type', title_fontsize=12, fontsize=10, facecolor='white', 
           markerfirst=True, handlelength=2, handletextpad=0.5)

# add annotations
# -------------------------------------------------------
for p in g.patches:
    g.annotate(format(p.get_height(), '.0f'), (p.get_x() + p.get_width() / 2., p.get_height()), 
               ha = 'center', va = 'center', xytext = (0, 10), textcoords = 'offset points')
# -------------------------------------------------------

separators = [0.5, 1.5, 2.5, 3.5]
for loc in separators:
    plt.axvline(loc, ls='--', color='grey', linewidth=1, alpha=0.4);
# =====================================================
# /////////////////////////////////////////////////////


# Plot 2 : Point plot
# =====================================================
# /////////////////////////////////////////////////////
plt.sca(axes[2])

# Assign color palette
flatui = ['#d62d6e', '#d6b42d', '#2dd6ce', '#2da6d6', '#395091']
sb.set_palette(flatui, n_colors=5, desat=0.8)

# Seaborn's ploint plot
ax1 = sb.pointplot(data = temp_df, x = 'bike_type', y = 'end_stations', hue = 'traffic');

# improve plot aesthetics
plt.title('Distribution of End stations rental traffic over bike type\n\n', fontsize = 16, weight = 'bold')
plt.xlabel('\nBike Type', fontsize = 14)
plt.ylabel('Number of End stations\n', fontsize = 14)
plt.yticks(fontsize = 12)
plt.xticks(fontsize = 12)

# modify characteristics of each line
for i, line in enumerate(ax1.lines):
    line.set_markevery(every=None)
    line.set_marker('o')        
    line.set_markersize(8)
    line.set_markeredgewidth(2)
    line.set_markerfacecolor('#ffffff')
    try:
        base_color = sb.color_palette()[i]
        line.set_markeredgecolor(base_color)
    except IndexError:
        continue

# add annotations
x_annotations = [0, 0, 0, 0, 0]
y_annotations = [0, 0, 0, 0, 0]
alignments = ['center', 'center', 'center', 'center', 'center']
# add_annotations('pass_type', 'traffic', 'start_stations', x_annotations, y_annotations, alignments)

# add legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, framealpha = 1, 
           borderpad=1, borderaxespad=1, loc = 'upper left', labelspacing=0.5, ncol = 1,  
           title='Rental Traffic', title_fontsize=12, fontsize=10, facecolor='white', 
           markerfirst=True, handlelength=2, handletextpad=0.5, bbox_to_anchor=(0.8, 1.15))

sb.despine(top=True, bottom=False, left=False, right=True, ax = ax1);
# =====================================================
# /////////////////////////////////////////////////////


# Plot 3 : Point plot
# =====================================================
# /////////////////////////////////////////////////////
plt.sca(axes[3])

# Assign palette as per requirement
sb.set_palette('deep', n_colors=4, desat=0.6)

# plot clustered bar chart
ax2 = sb.pointplot(data = temp_df, x = 'traffic', y = 'end_stations', hue = 'bike_type');

# improve plot aesthetics
plt.title('Classification of End stations by rental traffic & bike type\n\n', fontsize = 16, weight = 'bold')
plt.xlabel('\nBike Rental traffic', fontsize = 14)
plt.ylabel('', fontsize = 14)
plt.yticks(fontsize = 12)
plt.xticks(fontsize = 12)

# modify characteristics of each line
for i, line in enumerate(ax2.lines):
    line.set_markevery(every=None)
    line.set_marker('o')        
    line.set_markersize(8)
    line.set_markeredgewidth(2)
    line.set_markerfacecolor('#ffffff')
    try:
        base_color = sb.color_palette()[i]
        line.set_markeredgecolor(base_color)
    except IndexError:
        continue

# plot legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, ncol = 1, framealpha = 1, 
           borderpad=1, borderaxespad=1, bbox_to_anchor = (0.8, 0.9), loc = 6, labelspacing=0.5,  
           title='Bike type', title_fontsize=12, fontsize=10, facecolor='white', 
           markerfirst=True, handlelength=2, handletextpad=0.5)

# add annotations
x_annotations = [0, 0, 0, 0, 0]
y_annotations = [0, 0, 0, 0, 0]
alignments = ['center', 'center', 'center', 'center', 'center']
# add_annotations('traffic', 'pass_type', 'start_stations', x_annotations, y_annotations, alignments)

sb.despine(top=True, bottom=False, left=False, right=True, ax = ax2);
# =====================================================
# /////////////////////////////////////////////////////


plt.subplots_adjust(wspace=0.3, hspace=0.7)

# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.13.4 End stations bike return traffic categorized by bike type.png', dpi=300, bbox_inches='tight')

Observation:

The above plot depicts that there are number of end stations that experiences Low and Very Low bike return activity and are clustered closely together. This represents that the bike type has no influence on the return activity at these stations. This is because of the bike returns in correlation with the bike rentals. So no action subjected to bike type is required to increase the bike return activity at the end stations with Low and Very Low bike return activity.

Reform:

Smart bike has less number of end stations with Normal and High bike return traffic compared to other bike types. This is because Smart bikes has less number of start stations with Normal and High bike rental traffic compared to other bike types. However some stations might incur less bike returns than bike rentals subjected to Smart bikes and this scenario will result in smart bike deficiency at these partivular stations. Hence the smart bikes returns should be adjusted/redirected to match the bike demand in the specific end stations.

3.13.5 Distribution of start stations rental traffic categorized by pass type:

Obtain the rentals subjected to each end station categorized over pass type:

In [247]:
# create a dataframe with end stations returns over pass type
end_stations = bikeshare.groupby([bikeshare['end_station_id'], 
                                  bikeshare['pass_type']]).size().reset_index(name='returns')
end_stations.head()
Out[247]:
end_station_id pass_type returns
0 3005 Walk-up 2462
1 3005 One Day 5152
2 3005 Monthly 28014
3 3005 Flex 19
4 3005 Annual 2992

Categorize the rental traffic values into categorical sections:

In [248]:
#use pd.cut function can attribute the values into its specific bins
bin = [0,10,100,1000,10000,100000]
category = pd.cut(end_stations['returns'],bin)
category = category.to_frame()
category.columns = ['return_bins']
category['pass_type'] = end_stations['pass_type']
category['end_station_id'] = end_stations['end_station_id']
category = category.reindex(columns=['end_station_id', 'pass_type', 'return_bins'])
category.head()
Out[248]:
end_station_id pass_type return_bins
0 3005 Walk-up (1000, 10000]
1 3005 One Day (1000, 10000]
2 3005 Monthly (10000, 100000]
3 3005 Flex (10, 100]
4 3005 Annual (1000, 10000]
In [249]:
# obtain the unique categorical rental bins
category.return_bins.sort_values(ascending=True).unique()
Out[249]:
[(0, 10], (10, 100], (100, 1000], (1000, 10000], (10000, 100000]]
Categories (5, interval[int64]): [(0, 10] < (10, 100] < (100, 1000] < (1000, 10000] < (10000, 100000]]

Label the return bins:

In [250]:
%%time

def label_race(df):
    df.loc[df['return_bins'] == df.return_bins.sort_values(ascending=True).unique()[0],'traffic'] = 'Very Low'
    df.loc[df['return_bins'] == df.return_bins.sort_values(ascending=True).unique()[1],'traffic'] = 'Low'
    df.loc[df['return_bins'] == df.return_bins.sort_values(ascending=True).unique()[2],'traffic'] = 'Normal'
    df.loc[df['return_bins'] == df.return_bins.sort_values(ascending=True).unique()[3],'traffic'] = 'High'
    df.loc[df['return_bins'] == df.return_bins.sort_values(ascending=True).unique()[4],'traffic'] = 'Very High'
    
df = category
label_race(df)

level_order = ['Very Low', 'Low', 'Normal', 'High', 'Very High']
ordered_cat = pd.api.types.CategoricalDtype(ordered = True, categories = level_order)
df['traffic'] = df['traffic'].astype(ordered_cat)

category.traffic.value_counts()
Wall time: 59 ms
Out[250]:
Normal       344
Low          317
High         170
Very Low     124
Very High     11
Name: traffic, dtype: int64

Prepare a dataframe to categorize end stations over rental traffic and pass_type.

In [251]:
# prepare a dataframe to categorize end stations over rental traffic and pass_type
temp_df = category.groupby([category['traffic'], 
                            category['pass_type']]).size().reset_index(name='end_stations')
temp_df.head(10)
Out[251]:
traffic pass_type end_stations
0 Very Low Walk-up 3
1 Very Low One Day 30
2 Very Low Monthly 17
3 Very Low Flex 21
4 Very Low Annual 53
5 Low Walk-up 11
6 Low One Day 91
7 Low Monthly 66
8 Low Flex 6
9 Low Annual 143

Data Dashboard:

Plot the distribution of End station traffic based on pass type:

In [252]:
def add_annotations(category_var, hue_var, counts_var, x_annotations, y_annotations, alignments):
    '''add custom annotations to the plot based on hue and category'''
    labels = temp_df[category_var].sort_values(ascending=True).unique()
    hues = temp_df[hue_var].sort_values(ascending=True).unique()
    for loc, var in enumerate(hues):
        cat_counts = temp_df[temp_df[hue_var] == var][counts_var].values
        for i, label in enumerate(labels):
            try:
                pct_string = '{:0.0f}'.format(cat_counts[i])
                plt.text(i + x_annotations[i], cat_counts[i] + y_annotations[i], pct_string, 
                         ha = alignments[i], color = 'black', fontsize = 13, 
                         bbox=dict(pad=1.9,alpha=0.2,color='none',fc=sb.color_palette()[loc]))
            except KeyError:
                continue


# Assign grid and figure size
fig, axes = plt.subplots(figsize=(16,11), nrows=2, ncols=2) # grid of 3x4 subplots
axes = axes.flatten() # reshape from 3x4 array into 12-element vector
sb.set_style('white')


# Plot 0 : Clustered Bar chart
# =====================================================
# /////////////////////////////////////////////////////
plt.sca(axes[0])

# Assign palette as per requirement
flatui = ['bisque', 'lightsalmon', 'darksalmon', 'salmon', 'tomato']
sb.set_palette(flatui, n_colors=5, desat=0.6)

# plot clustered bar chart
g = sb.countplot(data = df, x = 'pass_type', hue = 'traffic', alpha = 0.8, saturation = 1)

# improve plot aesthetics
plt.title('Distribution of End stations return traffic over pass type\n\n', fontsize = 16, weight = 'bold')
plt.xlabel('\nPass Type', fontsize = 14)
plt.ylabel('Number of End stations\n\n', fontsize = 14)
plt.yticks([], fontsize = 12)
plt.xticks(fontsize = 12)
sb.despine(top=True, left=True, right=True, bottom=False)

# plot legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, ncol = 1, framealpha = 1, 
           borderpad=1, borderaxespad=1, bbox_to_anchor = (0.9, 1.15), loc = 'upper left', labelspacing=0.5,  
           title='Return Traffic', title_fontsize=12, fontsize=10, facecolor='white', 
           markerfirst=True, handlelength=2, handletextpad=0.5)

# add annotations
for p in g.patches:
    g.annotate(format(p.get_height(), '.0f'), (p.get_x() + p.get_width() / 2., p.get_height()), 
               ha = 'center', va = 'center', xytext = (0, 10), textcoords = 'offset points', fontsize = 12)

separators = [0.5, 1.5, 2.5, 3.5]
for loc in separators:
    plt.axvline(loc, ls='--', color='grey', linewidth=1, alpha=0.4);
# =====================================================
# /////////////////////////////////////////////////////



# Plot 1 : Clustered Bar chart
# =====================================================
# /////////////////////////////////////////////////////
plt.sca(axes[1])

# Assign palette as per requirement
flatui = ['indianred', 'lightcoral', 'darksalmon', 'salmon', 'lightsalmon']
sb.set_palette('deep', n_colors=5, desat=0.6)

# plot clustered bar chart
g = sb.countplot(data = category, x = 'traffic', hue = 'pass_type', alpha = 0.8, saturation = 1)

# improve plot aesthetics
plt.title('Classification of End stations by return traffic & pass type\n\n', fontsize = 16, weight = 'bold')
plt.xlabel('\nBike Return traffic', fontsize = 14)
plt.ylabel('', fontsize = 14)
plt.yticks([], fontsize = 12)
plt.xticks(fontsize = 12)
sb.despine(top=True, left=True, right=True, bottom=False)

# plot legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, ncol = 1, framealpha = 1, 
           borderpad=1, borderaxespad=1, bbox_to_anchor = (0.8, 0.9), loc = 6, labelspacing=0.5,  
           title='Pass type', title_fontsize=12, fontsize=10, facecolor='white', 
           markerfirst=True, handlelength=2, handletextpad=0.5)

# add annotations
for p in g.patches:
    g.annotate(format(p.get_height(), '.0f'), (p.get_x() + p.get_width() / 2., p.get_height()), 
               ha = 'center', va = 'center', xytext = (0, 10), textcoords = 'offset points', fontsize = 12)

separators = [0.5, 1.5, 2.5, 3.5]
for loc in separators:
    plt.axvline(loc, ls='--', color='grey', linewidth=1, alpha=0.4);
# =====================================================
# /////////////////////////////////////////////////////



# Plot 2 : Point plot
# =====================================================
# /////////////////////////////////////////////////////
plt.sca(axes[2])

# Assign color palette
flatui = ['#d62d6e', '#d6b42d', '#2dd6ce', '#2da6d6', '#395091']
sb.set_palette(flatui, n_colors=5, desat=0.8)

# Seaborn's ploint plot
ax1 = sb.pointplot(data = temp_df, x = 'pass_type', y = 'end_stations', hue = 'traffic');

# improve plot aesthetics
plt.title('Distribution of End stations rental traffic over pass type\n\n', fontsize = 16, weight = 'bold')
plt.xlabel('\nPass Type', fontsize = 14)
plt.ylabel('Number of end stations\n', fontsize = 14)
plt.yticks(fontsize = 12)
plt.xticks(fontsize = 12)

# modify characteristics of each line
for i, line in enumerate(ax1.lines):
    line.set_markevery(every=None)
    line.set_marker('o')        
    line.set_markersize(8)
    line.set_markeredgewidth(2)
    line.set_markerfacecolor('#ffffff')
    try:
        base_color = sb.color_palette()[i]
        line.set_markeredgecolor(base_color)
    except IndexError:
        continue

# add annotations
x_annotations = [0, 0, 0, 0, 0]
y_annotations = [0, 0, 0, 0, 0]
alignments = ['center', 'center', 'center', 'center', 'center']
# add_annotations('pass_type', 'traffic', 'start_stations', x_annotations, y_annotations, alignments)

# add legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, framealpha = 1, 
           borderpad=1, borderaxespad=1, loc = 'upper left', labelspacing=0.5, ncol = 1,  
           title='Return Traffic', title_fontsize=12, fontsize=10, facecolor='white', 
           markerfirst=True, handlelength=2, handletextpad=0.5, bbox_to_anchor=(0.9, 1.15))

sb.despine(top=True, bottom=False, left=False, right=True, ax = ax1);
# =====================================================
# /////////////////////////////////////////////////////


# Plot 3 : Point plot
# =====================================================
# /////////////////////////////////////////////////////
plt.sca(axes[3])

# Assign palette as per requirement
flatui = ['indianred', 'lightcoral', 'darksalmon', 'salmon', 'lightsalmon']
sb.set_palette('deep', n_colors=5, desat=0.6)

# plot clustered bar chart
ax2 = sb.pointplot(data = temp_df, x = 'traffic', y = 'end_stations', hue = 'pass_type');

# improve plot aesthetics
plt.title('Classification of End stations by rental traffic & pass type\n\n', fontsize = 16, weight = 'bold')
plt.xlabel('\nBike Return traffic', fontsize = 14)
plt.ylabel('', fontsize = 14)
plt.yticks(fontsize = 12)
plt.xticks(fontsize = 12)

# modify characteristics of each line
for i, line in enumerate(ax2.lines):
    line.set_markevery(every=None)
    line.set_marker('o')        
    line.set_markersize(8)
    line.set_markeredgewidth(2)
    line.set_markerfacecolor('#ffffff')
    try:
        base_color = sb.color_palette()[i]
        line.set_markeredgecolor(base_color)
    except IndexError:
        continue

# plot legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, ncol = 1, framealpha = 1, 
           borderpad=1, borderaxespad=1, bbox_to_anchor = (0.8, 0.9), loc = 6, labelspacing=0.5,  
           title='Pass type', title_fontsize=12, fontsize=10, facecolor='white', 
           markerfirst=True, handlelength=2, handletextpad=0.5)

# add annotations
x_annotations = [0, 0, 0, 0, 0]
y_annotations = [0, 0, 0, 0, 0]
alignments = ['center', 'center', 'center', 'center', 'center']
# add_annotations('traffic', 'pass_type', 'start_stations', x_annotations, y_annotations, alignments)

sb.despine(top=True, bottom=False, left=False, right=True, ax = ax2);

# =====================================================
# /////////////////////////////////////////////////////

plt.subplots_adjust(wspace=0.3, hspace=0.7)

# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.13.5 End stations bike return traffic categorized by pass type.png', dpi=300, bbox_inches='tight')

Observations:

  • Bike returns subjected to Annual pass type has high number of end stations with Low and Very Low return traffic. This is because, the Annual pass has high number of stations with Low and Very Low rental activity.
  • There exists many end stations with relatively Low bike rental activity, subjected to One Day pass.

Reform:

  • As there exists many end stations with relatively Low bike rental activity subjected to One Day pass, reforms should be taken to redirect the bike returns from other stations to match the demand for bikes at these end stations.

3.12.6 Distribution of End stations bike return traffic categorized by fare type:

Obtain the rentals subjected to each end station categorized over fare type:

In [253]:
# create a dataframe with end stations rentals over fare type
end_stations = bikeshare.groupby([bikeshare['end_station_id'], 
                                  bikeshare['fare_type']]).size().reset_index(name='returns')
end_stations.head()
Out[253]:
end_station_id fare_type returns
0 3005 Base 35569
1 3005 Extended 3070
2 3006 Base 15186
3 3006 Extended 1244
4 3007 Base 10932

Categorize the return traffic values into categorical sections:

In [254]:
#use pd.cut function can attribute the values into its specific bins
bin = [0,10,100,1000,10000,100000]
category = pd.cut(end_stations['returns'],bin)
category = category.to_frame()
category.columns = ['return_bins']
category['fare_type'] = end_stations['fare_type']
category['end_station_id'] = end_stations['end_station_id']
category = category.reindex(columns=['end_station_id', 'fare_type', 'return_bins'])
category.head()
Out[254]:
end_station_id fare_type return_bins
0 3005 Base (10000, 100000]
1 3005 Extended (1000, 10000]
2 3006 Base (10000, 100000]
3 3006 Extended (1000, 10000]
4 3007 Base (10000, 100000]
In [255]:
# obtain the unique categorical return bins
category.return_bins.sort_values(ascending=True).unique()
Out[255]:
[(0, 10], (10, 100], (100, 1000], (1000, 10000], (10000, 100000]]
Categories (5, interval[int64]): [(0, 10] < (10, 100] < (100, 1000] < (1000, 10000] < (10000, 100000]]

Label the return bins:

In [256]:
%%time

def label_race(df):
    df.loc[df['return_bins'] == df.return_bins.sort_values(ascending=True).unique()[0],'traffic'] = 'Very Low'
    df.loc[df['return_bins'] == df.return_bins.sort_values(ascending=True).unique()[1],'traffic'] = 'Low'
    df.loc[df['return_bins'] == df.return_bins.sort_values(ascending=True).unique()[2],'traffic'] = 'Normal'
    df.loc[df['return_bins'] == df.return_bins.sort_values(ascending=True).unique()[3],'traffic'] = 'High'
    df.loc[df['return_bins'] == df.return_bins.sort_values(ascending=True).unique()[4],'traffic'] = 'Very High'
    
df = category
label_race(df)

level_order = ['Very Low', 'Low', 'Normal', 'High', 'Very High']
ordered_cat = pd.api.types.CategoricalDtype(ordered = True, categories = level_order)
df['traffic'] = df['traffic'].astype(ordered_cat)

category.traffic.value_counts()
Wall time: 39 ms
Out[256]:
Normal       210
Low          150
High         125
Very Low      42
Very High     22
Name: traffic, dtype: int64

Prepare a dataframe to categorize end stations over return traffic and fare_type.

In [257]:
# prepare a dataframe to categorize end stations over return traffic and fare_type
temp_df = category.groupby([category['traffic'], 
                            category['fare_type']]).count()['end_station_id'].reset_index(name='end_stations')
temp_df.head(10)
Out[257]:
traffic fare_type end_stations
0 Very Low Base 15
1 Very Low Extended 27
2 Low Base 50
3 Low Extended 100
4 Normal Base 103
5 Normal Extended 107
6 High Base 89
7 High Extended 36
8 Very High Base 20
9 Very High Extended 2

Data Dashboard:

Plot the distribution of End station traffic based on fare type:

In [303]:
def add_annotations(category_var, hue_var, counts_var, x_annotations, y_annotations, alignments):
    '''add custom annotations to the plots based on hue and category'''
    labels = temp_df[category_var].sort_values(ascending=True).unique()
    hues = temp_df[hue_var].sort_values(ascending=True).unique()
    for loc, var in enumerate(hues):
        cat_counts = temp_df[temp_df[hue_var] == var][counts_var].values
        for i, label in enumerate(labels):
            try:
                pct_string = '{:0.0f}'.format(cat_counts[i])
                plt.text(i + x_annotations[i], cat_counts[i] + y_annotations[i], pct_string, 
                         ha = alignments[i], color = 'black', fontsize = 13, 
                         bbox=dict(pad=1.9,alpha=0.2,color='none',fc=sb.color_palette()[loc]))
            except IndexError:
                continue
            

# Assign grid and figure size
fig, axes = plt.subplots(figsize=(16,11), nrows=2, ncols=2) # grid of 3x4 subplots
axes = axes.flatten() # reshape from 3x4 array into 12-element vector
sb.set_style('white')


# Plot 0 : Clustered Bar chart
# =====================================================
# /////////////////////////////////////////////////////
plt.sca(axes[0])

# Assign palette as per requirement
flatui = ['bisque', 'lightsalmon', 'darksalmon', 'salmon', 'tomato']
sb.set_palette(flatui, n_colors=5, desat=0.6)

# plot clustered bar chart
g = sb.countplot(data = category, x = 'fare_type', hue = 'traffic', alpha = 0.8, saturation = 0.8)

# improve plot aesthetics
plt.title('Distribution of End stations return traffic over fare type\n\n', fontsize = 16, weight = 'bold')
plt.xlabel('\nFare Type', fontsize = 14)
plt.ylabel('Number of End stations\n', fontsize = 14)
plt.yticks([], fontsize = 12)
plt.xticks(fontsize = 12)
sb.despine(left=True)

# plot legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, ncol = 1, framealpha = 1, 
           borderpad=1, borderaxespad=1, bbox_to_anchor = (0.85, 0.8), loc = 6, labelspacing=0.5,  
           title='Return Traffic', title_fontsize=12, fontsize=10, facecolor='white', 
           markerfirst=True, handlelength=2, handletextpad=0.5)

# add annotations
# -------------------------------------------------------
for p in g.patches:
    g.annotate(format(p.get_height(), '.0f'), (p.get_x() + p.get_width() / 2., p.get_height()), 
               ha = 'center', va = 'center', xytext = (0, 10), textcoords = 'offset points', fontsize = 12)
# -------------------------------------------------------

separators = [0.5]
for loc in separators:
    plt.axvline(loc, ls='--', color='grey', linewidth=1, alpha=0.4);
# =====================================================
# /////////////////////////////////////////////////////


# Plot 1 : Clustered Bar chart
# =====================================================
# /////////////////////////////////////////////////////
plt.sca(axes[1])

# Assign palette as per requirement
flatui = ['#cc2e88', '#fa98d0']
sb.set_palette(flatui, desat = 0.8)

# plot clustered bar chart
g = sb.countplot(data = category, x = 'traffic', hue = 'fare_type', alpha = 0.8, saturation = 0.8)

# improve plot aesthetics
plt.title('Classification of End stations by return traffic & fare type\n\n', fontsize = 16, weight = 'bold')
plt.xlabel('\nBike Return traffic', fontsize = 14)
plt.ylabel('', fontsize = 14)
plt.yticks([], fontsize = 12)
plt.xticks(fontsize = 12)
sb.despine(left=True)

# plot legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, ncol = 1, framealpha = 1, 
           borderpad=1, borderaxespad=1, bbox_to_anchor = (0.8, 0.9), loc = 6, labelspacing=0.5,  
           title='Fare type', title_fontsize=12, fontsize=10, facecolor='white', 
           markerfirst=True, handlelength=2, handletextpad=0.5)

# add annotations
for p in g.patches:
    g.annotate(format(p.get_height(), '.0f'), (p.get_x() + p.get_width() / 2., p.get_height()), 
               ha = 'center', va = 'center', xytext = (0, 10), textcoords = 'offset points', fontsize = 12)
# =====================================================
# /////////////////////////////////////////////////////


# Plot 2 : Point plot
# =====================================================
# /////////////////////////////////////////////////////
plt.sca(axes[2])

# Assign color palette
flatui = ['#d62d6e', '#d6b42d', '#2dd6ce', '#2da6d6', '#395091']
sb.set_palette(flatui, n_colors=5, desat=0.8)

# Seaborn's ploint plot
ax1 = sb.pointplot(data = temp_df, x = 'fare_type', y = 'end_stations', hue = 'traffic');

# improve plot aesthetics
plt.title('Distribution of End stations return traffic over fare type\n\n', fontsize = 16, weight = 'bold')
plt.xlabel('\nFare Type', fontsize = 14)
plt.ylabel('Number of End stations\n', fontsize = 14)
plt.yticks([], fontsize = 12)
plt.xticks(fontsize = 12)

# add annotations
x_annotations = [-0.05, 0.05]
y_annotations = [0, 0]
alignments = ['right', 'left']
add_annotations('fare_type', 'traffic', 'end_stations', x_annotations, y_annotations, alignments)

# add legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, framealpha = 1, 
           borderpad=1, borderaxespad=1, loc = 'upper left', labelspacing=0.5, ncol = 1,  
           title='Return Traffic', title_fontsize=12, fontsize=10, facecolor='white', 
           markerfirst=True, handlelength=2, handletextpad=0.5, bbox_to_anchor=(0.85, 1))

sb.despine(top=True, bottom=True, left=True, right=True, ax = ax1);
# =====================================================
# /////////////////////////////////////////////////////


# Plot 3 : Point plot
# =====================================================
# /////////////////////////////////////////////////////
plt.sca(axes[3])

# Assign palette as per requirement
flatui = ['#cc2e88', '#fa98d0']
sb.set_palette(flatui, desat = 0.8)

# Seaborn's pointplot
ax2 = sb.pointplot(data = temp_df, x = 'traffic', y = 'end_stations', hue = 'fare_type');

# improve plot aesthetics
plt.title('Classification of End stations by return traffic & fare type\n\n', fontsize = 16, weight = 'bold')
plt.xlabel('\nBike Return traffic', fontsize = 14)
plt.ylabel('', fontsize = 14)
plt.yticks(fontsize = 12)
plt.xticks(fontsize = 12)

# modify characteristics of each line
for i, line in enumerate(ax2.lines):
    line.set_markevery(every=None)
    line.set_marker('o')        
    line.set_markersize(8)
    line.set_markeredgewidth(2)
    line.set_markerfacecolor('#ffffff')
    try:
        base_color = sb.color_palette()[i]
        line.set_markeredgecolor(base_color)
    except IndexError:
        continue

# plot legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, ncol = 1, framealpha = 1, 
           borderpad=1, borderaxespad=1, bbox_to_anchor = (0.8, 0.9), loc = 6, labelspacing=0.5,  
           title='Fare type', title_fontsize=12, fontsize=10, facecolor='white', 
           markerfirst=True, handlelength=2, handletextpad=0.5)

# add annotations
x_annotations = [0, 0, 0, 0, 0]
y_annotations = [6, 6, 6, 6, 6]
alignments = ['center', 'center', 'center', 'center', 'center']
# add_annotations('traffic', 'fare_type', 'end_stations', x_annotations, y_annotations, alignments)

sb.despine(top=True, bottom=False, left=False, right=True, ax = ax2);
# =====================================================
# /////////////////////////////////////////////////////

plt.subplots_adjust(wspace=0.3, hspace=0.7)

# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.13.6 End stations return traffic categorized by fare type.png', dpi=300, bbox_inches='tight')

Observations:

  • The bike returns subjected to Extended Fares incur a high number of end stations with Low return traffic and less number of stations with High return traffic. This denotes that Extended Fares are less desired by the customers.
  • The bike returns subjected to Base Fares incur a high number of end stations with High return traffic and less number of stations with Low return traffic. This denotes that Base Fares are more preferred by the customers.

Reform:

  • Actions should be taken to encourage the customers to ride the bikes for longer durations to incur Extended Fares thus generating more income to the company.

----------------------------------------------

3.13.7 Insights:

  1. The bike return traffic is not equally distributed over the end stations. There exists end stations with Very Low bike return activity. However this does not imply that these end stations are to be eliminated as they might incur good bike rental traffic and still prove to be a station that procure acceptable business metrics.
  2. Many end stations subjected to One Way trips experience a Low bike return traffic. When Round Trips are involved, the bike rental and bike returns are subjected to one and same bike station. Hence when a station experience a Low bike return traffic, it also implies a Low bike rental traffic subjected to the same station. However, unlike Round Trips, having a Low bike returns subjected to One Way trips will create a deficit in the availabitiy of the bikes in the specific station. Hence an immediate action is required to redirect bikes from stations with High and Very High bike return traffic to stations with Low and Very Low bike return traffic subjected to One Way trips to normalize the availability of bikes over all staions.
  3. A major number of end stations subjected to Round Trips experience Low bike return traffic.
  4. There are number of end stations that experiences Low and Very Low bike return activity and are clustered closely together. This represents that the bike type has no influence on the return activity at these stations. This is because of the bike returns in correlation with the bike rentals. So no action subjected to bike type is required to increase the bike return activity at the end stations with Low and Very Low bike return activity.
  5. Bike returns subjected to Annual pass type has high number of end stations with Low and Very Low return traffic. This is because, the Annual pass has high number of stations with Low and Very Low rental activity.
  6. There exists many end stations with relatively Low bike rental activity, subjected to One Day pass
  7. The bike returns subjected to Extended Fares incur a high number of end stations with Low return traffic and less number of stations with High return traffic. This denotes that Extended Fares are less desired by the customers.
  8. The bike returns subjected to Base Fares incur a high number of end stations with High return traffic and less number of stations with Low return traffic. This denotes that Base Fares are more preferred by the customers.

3.13.8 Reforms proposed:

  1. Promotions should be announced to encourage customers to opt for the Round Trips at the end stations with Low bike return traffic.
  2. Smart bike has less number of end stations with Normal and High bike return traffic compared to other bike types. This is because Smart bikes has less number of start stations with Normal and High bike rental traffic compared to other bike types. However some stations might incur less bike returns than bike rentals subjected to Smart bikes and this scenario will result in smart bike deficiency at these partivular stations. Hence the smart bikes returns should be adjusted/redirected to match the bike demand in the specific end stations.
  3. As there exists many end stations with relatively Low bike rental activity subjected to One Day pass, reforms should be taken to redirect the bike returns from other stations to match the demand for bikes at these end stations.
  4. Actions should be taken to encourage the customers to ride the bikes for longer durations to incur Extended Fares thus generating more income to the company.

3.14 Is there a requirement for launching a remainder to notify the expiration of base fare to the customers?

  • Column: fare_type
  • Data type: categorical, ordinal
  • Plot : Histogram, Bar chart

Display the top 5 most frequent trip durations for the extended fares:

In [259]:
# limit the dataset to the extended fare type
extended_df = bikeshare.query(' fare_type == "Extended" ')
# obtain the most frequent trip durations for trips with extended fares
freq_minutes = extended_df.duration_min.value_counts().head(5).index

print('Most frequent extended trip durations:')
print('-'*38)
for i, minute in enumerate(freq_minutes):
    print('{}. {} minutes'.format(i+1, minute))
Most frequent extended trip durations:
--------------------------------------
1. 31 minutes
2. 32 minutes
3. 33 minutes
4. 34 minutes
5. 35 minutes

It appears that most frequent extended fares falls under the margin of 5 minutes from the threshold of Base Fare. Plot the distribution of extended rides with a 5 minute grace period for further analysis. To prevent the effect of outliers, limit the dataset to trip durations under 120 minutes.

In [260]:
# calculate the percentage of the dataset that falls under `2 hour` trip duration.
df_percent = np.round((bikeshare.query(' duration_min <= 120 ').shape[0]/bikeshare.shape[0])*100, 2)
print('The percentage of the dataset that falls under 2 hour trip duration: {} %'.format(df_percent))
The percentage of the dataset that falls under 2 hour trip duration: 96.9 %

3.14.1 Plot the distribution of grace period rides over other extended rides:

In [261]:
# Assign color palette and grid as per requirement
sb.set_style('white')
flatui = ["gold"]
sb.set_palette(palette = flatui, n_colors = 10, desat = 0.7)

# prepare the data for the plot
# -------------------------------------------------------
# Limit the dataset that has entries under 2 hours duration
duration_lim_120 = bikeshare.query(' duration_min <= 120 and duration_min > 30')

# Limit the dataset that has entries under 35 hours duration
duration_lim_35 = bikeshare.query(' duration_min <= 35 and duration_min > 30')

base_color = sb.color_palette()[0]
bin_edges = np.arange(30, duration_lim_120.duration_min.max()+1, 1)
x_locs = np.arange(30, 120+10, 10)
# -------------------------------------------------------

ax1 = plt.hist(duration_lim_120['duration_min'], color = base_color, bins = bin_edges)
ax2 = plt.hist(duration_lim_35['duration_min'], color = 'c', bins = bin_edges)

# improve plot aesthetics
plt.title('Distribution of extended trip durations under 2 hours\n',  weight = 'bold', fontsize = 16)
plt.xlabel('\nDuration (minutes)', fontsize = 14)
plt.ylabel('Rentals (thousands)\n', fontsize = 14)
plt.xticks(x_locs, x_locs, fontsize = 12)

# convert the y_ticks into units of thousands
locs, labels = plt.yticks()
y_tick_locs = np.arange(0, int(math.ceil(max(locs)))+1000, 1000)
y_tick_names = ['{:0.0f} K'.format(loc/1000) for loc in y_tick_locs]
plt.yticks(y_tick_locs, y_tick_names, fontsize=12)

# add custom legend
# -------------------------------------------------------
custom = [Line2D([], [], color = 'c', linestyle='-', linewidth = 2),
          Line2D([], [], color = base_color, linestyle='-', linewidth = 2)]

plt.legend(custom, ['Grace', 'Extended'], scatterpoints=1, frameon=True, fancybox=True, 
           shadow=False, framealpha = 1, borderpad=1, borderaxespad=1, labelspacing=0.5, 
           ncol = 1, title='Duration period', title_fontsize=12, fontsize=10, facecolor='white', 
           markerfirst=True, handlelength=2, handletextpad=0.5, bbox_to_anchor=(1.15, 1));
# -------------------------------------------------------

sb.despine();

# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.14.1 Distribution of extended trip durations under 2 hours.png', dpi=300, bbox_inches='tight')

Observation:

  • It appears that, the customers most frequently tend to return the bikes just after the base fare margin. Even if this scenario is one's own result of individual preference or behaviour, this might gives rise to customer's dissatisfaction of paying extra fee (extended fare), when one could avoid it with a little caution.
  • This denotes that there is a requirement for launching a remainder to notify the expiration of Base Fare to the customers.

3.14.2 Calculatation of percentage of customers that are eligible for a 5 minute grace period to extended fares:

In [262]:
# limit the dataset to the extended fare type
extended_df = bikeshare.query(' fare_type == "Extended" ')

grace_percent = math.ceil(extended_df.query(' duration_min > 30 and duration_min <= 35 ').shape[0]/extended_df.shape[0]*100)
print('Percentage of customers who are eligible for grace period: {} %'.format(grace_percent))
Percentage of customers who are eligible for grace period: 16 %

Plot the percentage of customer rides that are eligible for 5 minute grace period:

In [263]:
sb.set_palette('deep', n_colors=2, desat=0.6)

# prepare the data for the plot
extended_df = bikeshare.query('duration_min > 30')
counts = (extended_df.query(' duration_min <= 35 ')['trip_id'].count(), 
          extended_df.query(' duration_min > 35 ')['trip_id'].count())
order = ['Eligible', 'Not Eligible']

# seaborn's bar plot
sb.barplot(x = order, y = counts, alpha= 1, saturation = 0.8)

# improve plot aesthetics
plt.title('Customer rides eligible for 5 minute grace period\n',  weight = 'bold', fontsize = 16)
plt.xlabel('\nCustomer eligibilty', fontsize = 14)
plt.ylabel('Number of rides (thousands)\n', fontsize = 14)

# loop through yticks to convert them into units of thousands
locs, labels = plt.yticks()
new_labels = ['{:0.0f} k'.format(loc/1000) for loc in locs]
plt.yticks(locs, new_labels, fontsize = 12)
plt.xticks(fontsize = 12)

# add annotations
# -------------------------------------------------------
total_counts = sum(counts)

# get the current tick locations and labels
locs, labels = plt.xticks()

# loop through each pair of locations and labels
for i, loc in enumerate(locs):
    count = counts[i]
    pct_string = '{:0.0f}%'.format(100*count/total_counts)
    # print the annotation
    plt.text(loc, count + (total_counts/20), pct_string, ha = 'center', color = 'black', fontsize = 14)
# -------------------------------------------------------

sb.despine();

# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.14.2 Customer rides eligible for 5 minute grace period.png', dpi=300, bbox_inches='tight')

Observation:

  • The above plot depicts that 16% of the bike rides are eligible for a 5 minute grace period from charging Extended Fare.
  • Since the percent of rides that are eligible for grace period is far less compared to the rides with Extended Fares, the income generated from Extended Fares will only have a small impact/decrement. Hence diployment of Grace period is a fair option with little impact on income generated from Extended Fares

----------------------------------------------

3.14.3 Insights:

  1. It appears that, the customers most frequently tend to return the bikes just after the Base Fare margin. Even if this scenario is one's own result of individual preference or behaviour, this might gives rise to customer's dissatisfaction of paying extra fee (extended fare), when one could avoid it with a little caution. This denotes that there is a requirement for launching a remainder to notify the expiration of Base Fare to the customers.
  2. It appears 16% of the bike rides are eligible for a 5 minute grace period from charging Extended Fare.

3.14.4 Reforms proposed:

  1. Launching a remainder by mobile notification/other sources to notify the expiration of Base Fare to the customers will alert the customer to return the bike to the nearest bike station to avoid Extended Fare will result in increased customer satisfaction.
  2. The alternate option is to give a 5 minute grace period to the Extended Fares. Since the percent of rides that are eligible for 5 minute grace period is far less compared to the rides with Extended Fares, the income generated from Extended Fares will only have a small impact/decrement. Hence diployment of Grace period is a fair option with little impact on the income generated from Extended Fares.

3.15 What is the average trip duration and the most frequent trip duration among the bike rentals? Expore the factors that influence the rental duration.

  • Column: duration_min
  • Data type: numerical, continuous
  • Plot : Bar chart, Pie plot

3.15.1 Categorical distribution of trip durations:

In [264]:
# compute the descriptive statistcs of trip durations
bikeshare.duration_min.describe()
Out[264]:
count    808589.000000
mean         29.861795
std         119.355799
min           0.000000
25%           6.000000
50%          12.000000
75%          23.000000
max        9283.000000
Name: duration_min, dtype: float64

The trip durations under 1 minute are probably because of the return of the bicycle immediately after the rental due to technical or other issue. Hence exclude the trips under 1 minute.

Breakdown the trip durations into categories and convert into a dataframe:

In [265]:
durations = {'trip_type' : pd.Series(['Small', 'Normal', 'Long', 'Very Long']), 
             'trip_count' : pd.Series([bikeshare.query(' duration_min >= 1 and duration_min < 10 ').shape[0], 
                                       bikeshare.query(' duration_min >= 10 and duration_min < 100 ').shape[0],
                                       bikeshare.query(' duration_min >= 100 and duration_min < 1000 ').shape[0], 
                                       bikeshare.query(' duration_min >= 1000 ').shape[0]])}

# create a Dataframe. 
trip_durations = pd.DataFrame(durations)
trip_durations
Out[265]:
trip_type trip_count
0 Small 336587
1 Normal 438460
2 Long 30467
3 Very Long 2436

Plot the categorical distribution of trip durations:

Bar chart:

In [266]:
# Assign grid and color palette as per requirement
plt.figure(figsize = [12, 4])
sb.set_style("white")
base_color = 'cadetblue'

# plot pre-calculations
duration_order = ['Very Long', 'Long', 'Normal', 'Small']
time_order = ['[1000, )', '[100, 1000)', '[10, 100)', '[1 , 10)']
trip_counts = trip_durations.trip_count
trip_order = trip_durations.trip_type
x_tick_values = np.arange(0, trip_counts.max() + 50000, 50000)
x_tick_names = ['{:0.0f} K'.format(v/1000) for v in x_tick_values]
y_tick_values = np.arange(0, len(duration_order)+1, 1)
y_tick_names = duration_order
clrs = ['indianred', 'indianred', 'cadetblue', 'cadetblue']

# bar plot
sb.barplot(x = trip_counts, y = trip_order, order = duration_order, palette=clrs, alpha= 1, saturation = 1)

# plot - visual enhancements
plt.title('Categorical distribution of trip durations\n', weight = 'bold', fontsize = 16)
plt.xticks(x_tick_values, x_tick_names, fontsize = 12)
plt.yticks(y_tick_values, y_tick_names, fontsize = 12)
plt.xlabel('\nNumber of rides (thousands)', fontsize = 14)
plt.ylabel('Duration type\n', fontsize = 14)

# Create a custom legend:
# -------------------------------------------------------
# Plot empty lists with the desired label
indents = [10, 13, 11, 13]
for duration, time, indent in zip(duration_order, time_order, indents):
    plt.scatter([], [], c='k', alpha=0.3,
                label= '{}'.format(duration).ljust(indent, ' ') + ' - ' + '{}'.format(time))
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=True, ncol = 1, framealpha = 1, 
           borderpad=1, borderaxespad=1, bbox_to_anchor = (1, 0.5), loc = 6, labelspacing=0.5,  
           title='Duration - minutes', title_fontsize=14, fontsize=12, facecolor='white', 
           markerfirst=True, handlelength=0.5, handletextpad=0.5)
# -------------------------------------------------------

sb.despine();

# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.15.1 Categorical distribution of trip durations.png', dpi=300, bbox_inches='tight')

Observation:

  • It appears that the most of the customers prefer to take the rides with either Normal trip durations or small trip durations. And avoids trips with Long and Very Long durations.

Reform:

  • Even if this behaviour is to be expected, encouraging customers to ride the bikes for longer durations will generate income and improve business metrics.
  • Organizing 2hr bike rallies and other events will attract enthusiasts to ride the bike for longer durations.
  • Announcing Low Fares for tourists will attract them to rent the bike for longer durations.

3.15.2 Calculate the most frequent and average trip durations:

The statistical analysis performed on the trip durations are effected by the presence of outliers. Hence individual ananlysis is performed on the dataset by limiting the dataset to trip durations under 30 minutes, 120 minutes along with the overall trip durations.

Calculate the average trip durations of the dataset under timeline limitations:

In [267]:
# calculate average trip durations of the dataset under timeline limitations:
overall_mean = math.ceil(bikeshare.duration_min.mean())
duration_lim_120_mean = math.ceil(bikeshare.query(' duration_min <= 120 ').duration_min.mean())
duration_lim_30_mean = math.ceil(bikeshare.query(' duration_min <= 30 ').duration_min.mean())
print('Dataset limitation'.ljust(20, ' '), ':', 'Avg. Trip duration')
print('-'*41)
print('overall'.ljust(20, ' '), ':', overall_mean, 'minutes')
print('under 120 minutes'.ljust(20, ' '), ':', duration_lim_120_mean, 'minutes')
print('under 30 minutes'.ljust(20, ' '), ':', duration_lim_30_mean, 'minutes')
Dataset limitation   : Avg. Trip duration
-----------------------------------------
overall              : 30 minutes
under 120 minutes    : 18 minutes
under 30 minutes     : 12 minutes

Calculate the most frequent trip durations of the dataset under timeline limitations:

In [268]:
# calculate most frequent trip durations of various timeline limitations
overall_mode = math.ceil(bikeshare.duration_min.mode())
duration_lim_120_mode = math.ceil(bikeshare.query(' duration_min <= 120 ').duration_min.mode())
duration_lim_30_mode = math.ceil(bikeshare.query(' duration_min <= 30 ').duration_min.mode())
print('Dataset limitation'.ljust(20, ' '), ':', 'Freq. Trip duration')
print('-'*42)
print('overall'.ljust(20, ' '), ':', overall_mode, 'minutes')
print('under 120 minutes'.ljust(20, ' '), ':', duration_lim_120_mode, 'minutes')
print('under 30 minutes'.ljust(20, ' '), ':', duration_lim_30_mode, 'minutes')
Dataset limitation   : Freq. Trip duration
------------------------------------------
overall              : 6 minutes
under 120 minutes    : 6 minutes
under 30 minutes     : 6 minutes

Convert the most frequent and average trip durations into a dataframe:

In [269]:
# convert the most frequent and average trip durations into a dataframe:
duration_df = pd.DataFrame()
duration_df['dataset_duration'] = ['< 30', '< 120', 'overall']
duration_df['avg_trip_duration'] = [12, 18, 30]
duration_df['freq_trip_duration'] = [6, 6, 6]
duration_df
Out[269]:
dataset_duration avg_trip_duration freq_trip_duration
0 < 30 12 6
1 < 120 18 6
2 overall 30 6

Plot the most frequent and average trip durations into a dataframe:

In [270]:
plt.figure(figsize = [12, 5])

# left plot: Average trip duration
# =====================================================
# /////////////////////////////////////////////////////
plt.subplot(1, 2, 1)
sb.set_style('white')
sb.set_palette(palette = "GnBu", n_colors = 3, desat = None)
# Seaborn's bar chart
ax1 = sb.barplot(data = duration_df, x = 'dataset_duration', y = 'avg_trip_duration')
# improve plot aesthetics
plt.title('Avg. Trip duration\n',  weight = 'bold', fontsize = 14, color = 'dimgrey')
plt.ylabel('', fontsize = 14)
plt.xlabel('\nTrip durations (minutes)', fontsize = 14)
plt.xticks(fontsize = 12)

# convert the yticks into integer values
locs, labels = plt.yticks()
y_ticks_new = np.arange(0, int(math.ceil(max(locs)))+10, 10)
plt.yticks(y_ticks_new, y_ticks_new, fontsize=12)

# add annotations
# -------------------------------------------------------
locs, labels = plt.xticks()
duration_avg_counts = duration_df.avg_trip_duration.values
duration_avg_max = duration_avg_counts.max()
clrs = ['gold' if (value > ((duration_avg_max*4)/5)) else 'limegreen' for value in duration_avg_counts]

# loop through each pair of locations
for loc, duration_avg_count, clr in zip(locs, duration_avg_counts, clrs):
    try:
        count = duration_avg_count
    except KeyError:
        count = 0   
    pct_string = '{:0.0f} min'.format(math.ceil(count))
    # print the annotation depending on the bar length
    plt.text(loc, count + int(duration_avg_max/30), pct_string, ha = 'center', color = 'black', fontsize = 12)
# -------------------------------------------------------
sb.despine(top=True, right=True, bottom=False, left=False);
# =====================================================
# /////////////////////////////////////////////////////


# right plot: Most frequent trip duration
# =====================================================
# /////////////////////////////////////////////////////
plt.subplot(1, 2, 2)
sb.set_style('white')
sb.set_palette(palette = "GnBu", n_colors = 3, desat = None)
# seaborl's bar plot
ax2 = sb.barplot(data = duration_df, x = 'dataset_duration', y = 'freq_trip_duration')
# improve plot aesthetics
plt.title('Most freq Trip duration\n',  weight = 'bold', fontsize = 14, color = 'dimgrey')
plt.ylabel('', fontsize = 14)
plt.xlabel('\nTrip durations (minutes)', fontsize = 14)
plt.xticks(fontsize = 12)

# set y-axis limits to be same as left plot and assign same yticks
plt.ylim(ax1.get_ylim())
plt.yticks(y_ticks_new, y_ticks_new, fontsize=12)

# add annotations
# -------------------------------------------------------
locs, labels = plt.xticks()
duration_freq_counts = duration_df.freq_trip_duration.values
duration_freq_max = duration_freq_counts.max()
clrs = ['gold' if (value > ((duration_freq_max*4)/5)) else 'limegreen' for value in duration_freq_counts]

# loop through each pair of locations
for loc, duration_freq_count, clr in zip(locs, duration_freq_counts, clrs):
    try:
        count = duration_freq_count
    except KeyError:
        count = 0   
    pct_string = '{:0.0f} min'.format(math.ceil(count))
    # print the annotation depending on the bar length
    plt.text(loc, count + (duration_freq_max/20), pct_string, ha = 'center', color = 'black', fontsize = 12)
# -------------------------------------------------------
sb.despine(top=True, right=True, bottom=False, left=False);
# =====================================================
# /////////////////////////////////////////////////////


plt.subplots_adjust(wspace=0.2, hspace=0.3)
plt.subplots_adjust(top=0.75)
plt.suptitle('Trip durations of the dataset under timeline limitations\n', fontsize = 16, weight = 'bold');

# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.15.2 Trip durations of the dataset under timeline limitations.png', dpi=300, bbox_inches='tight')

Observations:

  • The above plot depicts that the average trip duration of the Overall dataset is 30 minutes. However when the outliers are removed and re-evaluated, the average trip duration is down to 18 minutes.
  • Also for trips under 30 minutes, the average trip duration is 12 minutes.
  • The most frequent trip duration remains 6 minutes for all datasets limited by trip durations. This denoted that most customers rent the bikes for a quick short trip.

Deeper exploration of the factors that influence the trip durations for hidden insights:

  1. The trip durations equal to or under 1 minute are probably because of the return of the bike immediately after the rental due to technical or other issue. Hence the trips with or under 1 minute trip durations are excluded through out the rest of the analysis on trip durations.
  1. The statistical analysis performed on the trip durations are effected by the presence of outliers. Hence individual ananlysis is performed on the dataset by limiting the dataset to trip durations under 30 minutes, 120 minutes along with the overall trip durations.

3.15.3 Calculate the most frequent and average trip durations by trip type:

Calculate the average trip duration and the most frequent trip duration subjected to each trip type:

In [271]:
# Limit the dataset that has entries greater than 1 minute duration
duration_lim = bikeshare.query(' duration_min > 1 ')

oneway_mean = math.ceil(duration_lim.query(' trip_type == "One Way" ').duration_min.mean())
oneway_mode = duration_lim.query(' trip_type == "One Way" ').duration_min.mode()[0]

roundtrip_mean = math.ceil(duration_lim.query(' trip_type == "Round Trip" ').duration_min.mean())
roundtrip_mode = duration_lim.query(' trip_type == "Round Trip" ').duration_min.mode()[0]

print('Overall Dataset excluding 1 min')
print('='*31 + '\n')
print('Duration mean'.center(28,'-'))
print('oneway_mean    : ', oneway_mean, 'minutes')
print('roundtrip_mean : ', roundtrip_mean, 'minutes')
print('\n')
print('Duration mode'.center(28,'-'))
print('oneway_mode    : ', oneway_mode, 'minutes')
print('roundtrip_mode : ', roundtrip_mode, 'minutes')
Overall Dataset excluding 1 min
===============================

-------Duration mean--------
oneway_mean    :  24 minutes
roundtrip_mean :  71 minutes


-------Duration mode--------
oneway_mode    :  5 minutes
roundtrip_mode :  28 minutes
In [272]:
# Limit the dataset that has entries under 120 minutes duration excluding 1 min
duration_lim_120 = bikeshare.query(' duration_min > 1 and duration_min <= 120 ')

oneway_mean = math.ceil(duration_lim_120.query(' trip_type == "One Way" ').duration_min.mean())
oneway_mode = duration_lim_120.query(' trip_type == "One Way" ').duration_min.mode()[0]

roundtrip_mean = math.ceil(duration_lim_120.query(' trip_type == "Round Trip" ').duration_min.mean())
roundtrip_mode = duration_lim_120.query(' trip_type == "Round Trip" ').duration_min.mode()[0]

print('Dataset limited under 120 min excluding 1 min')
print('='*45 + '\n')
print('Duration mean'.center(28,'-'))
print('oneway_mean    : ', oneway_mean, 'minutes')
print('roundtrip_mean : ', roundtrip_mean, 'minutes')
print('\n')
print('Duration mode'.center(28,'-'))
print('oneway_mode    : ', oneway_mode, 'minutes')
print('roundtrip_mode : ', roundtrip_mode, 'minutes')
Dataset limited under 120 min excluding 1 min
=============================================

-------Duration mean--------
oneway_mean    :  16 minutes
roundtrip_mean :  39 minutes


-------Duration mode--------
oneway_mode    :  5 minutes
roundtrip_mode :  28 minutes
In [273]:
# Limit the dataset that has entries under 30 minutes duration excluding 1 min
duration_lim_30 = bikeshare.query(' duration_min > 1 and duration_min <= 30 ')

oneway_mean = math.ceil(duration_lim_30.query(' trip_type == "One Way" ').duration_min.mean())
oneway_mode = duration_lim_30.query(' trip_type == "One Way" ').duration_min.mode()[0]

roundtrip_mean = math.ceil(duration_lim_30.query(' trip_type == "Round Trip" ').duration_min.mean())
roundtrip_mode = duration_lim_30.query(' trip_type == "Round Trip" ').duration_min.mode()[0]

print('Dataset limited under 30 min excluding 1 min')
print('='*44 + '\n')
print('Duration mean'.center(28,'-'))
print('oneway_mean    : ', oneway_mean, 'minutes')
print('roundtrip_mean : ', roundtrip_mean, 'minutes')
print('\n')
print('Duration mode'.center(28,'-'))
print('oneway_mode    : ', oneway_mode, 'minutes')
print('roundtrip_mode : ', roundtrip_mode, 'minutes')
Dataset limited under 30 min excluding 1 min
============================================

-------Duration mean--------
oneway_mean    :  12 minutes
roundtrip_mean :  18 minutes


-------Duration mode--------
oneway_mode    :  5 minutes
roundtrip_mode :  28 minutes

Convert the most frequent and average trip durations categorized by trip type into a dataframe:

In [274]:
# convert the most frequent and average trip durations categorized by trip type into a dataframe:
duration_df = pd.DataFrame()
duration_df['dataset'] = ['< 30', '< 30', '< 120', '< 120', 'overall', 'overall']
duration_df['trip_type'] = ['One Way', 'Round Trip', 'One Way', 'Round Trip', 'One Way', 'Round Trip']
duration_df['duration_avg'] = [12, 18, 16, 39, 24, 71]
duration_df['duration_mode'] = [5, 28, 5, 28, 5, 28]
duration_df
Out[274]:
dataset trip_type duration_avg duration_mode
0 < 30 One Way 12 5
1 < 30 Round Trip 18 28
2 < 120 One Way 16 5
3 < 120 Round Trip 39 28
4 overall One Way 24 5
5 overall Round Trip 71 28

Plot the most frequent and average trip durations into a dataframe:

In [275]:
plt.figure(figsize = [12, 6])
flatui = ['deepskyblue', 'sandybrown']
sb.set_palette(flatui, n_colors=2, desat=0.6)


# left plot: point plot - Avg trip duration
# =====================================================
# /////////////////////////////////////////////////////
sb.set_style('white')
plt.subplot(1, 2, 1)

ax1 = sb.pointplot(data = duration_df, x = 'dataset', y = 'duration_avg', hue = 'trip_type')

# improve plot aesthetics
plt.title('Avg. Trip durations\n',  weight = 'bold', fontsize = 16, color = 'dimgrey')
plt.ylabel('Avg. duration (minutes)\n', fontsize = 14)
plt.xlabel('\nTrip durations (minutes)', fontsize = 14)
plt.xticks(fontsize = 12)

locs, labels = plt.yticks()
y_ticks_new = np.arange(0, int(math.ceil(max(locs)))+10, 10)
plt.yticks(y_ticks_new, y_ticks_new, fontsize=12)

# add annotations
# -------------------------------------------------------
locs = [0, 0, 1, 1, 2, 2]
avg_rental_counts = duration_df["duration_avg"]
avg_rental_types = duration_df["trip_type"]
avg_rental_max = avg_rental_counts.max()
clrs = ['gold' if (trip == "Round Trip") else 'limegreen' for trip in avg_rental_types ]

# loop through each pair of locations and assign text
for loc, avg_rental_count, clr in zip(locs, avg_rental_counts, clrs):
    try:
        count = avg_rental_count
    except KeyError:
        count = 0   
    pct_string = '{:0.0f} min'.format(math.ceil(count))
    # print the annotation depending on the bar length
    plt.text(loc-0.2, count + int(avg_rental_max/20), pct_string, ha = 'center', color = 'black', fontsize = 12,
             bbox={'pad':1.9,'alpha':0.2,'color':'none','fc':clr})
# -------------------------------------------------------

plt.legend('', frameon=False, fancybox=False)

sb.despine(top=True, bottom=False, left=False, right=True);
# =====================================================
# /////////////////////////////////////////////////////


# right plot: point plot - most frequent trip duration
# =====================================================
# /////////////////////////////////////////////////////
sb.set_style('white')
plt.subplot(1, 2, 2)

ax2 = sb.pointplot(data = duration_df, x = 'dataset', y = 'duration_mode', hue = 'trip_type')

# improve plot aesthetics
plt.title('Most frequent durations\n',  weight = 'bold', fontsize = 16, color = 'dimgrey')
plt.ylabel('Duration (minutes)\n', fontsize = 14)
plt.xlabel('\nTrip durations (minutes)', fontsize = 14)
plt.xticks(fontsize = 12)

# set y-axis limits to be same as left plot and assign same yticks
plt.ylim(ax1.get_ylim())
plt.yticks(y_ticks_new, y_ticks_new, fontsize=12)

# add annotations
# -------------------------------------------------------
locs = [0, 0, 1, 1, 2, 2]
freq_rental_counts = duration_df["duration_mode"]
freq_rental_types = duration_df["trip_type"]
freq_rental_max = freq_rental_counts.max()
clrs = ['gold' if (trip == "Round Trip") else 'limegreen' for trip in freq_rental_types ]

# loop through each pair of locations and assign text
for loc, freq_rental_count, clr in zip(locs, freq_rental_counts, clrs):
    try:
        count = freq_rental_count
    except KeyError:
        count = 0   
    pct_string = '{:0.0f} min'.format(math.ceil(count))
    # print the annotation depending on the bar length
    plt.text(loc, count + int(freq_rental_max/5), pct_string, ha = 'center', color = 'black', fontsize = 12,
             bbox={'pad':1.9,'alpha':0.2,'color':'none','fc':clr})
# -------------------------------------------------------

# plot legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, framealpha = 1, 
           borderpad=1, borderaxespad=1, loc = 'upper left', labelspacing=0.5, ncol = 2,  
           title='Trip type', title_fontsize=12, fontsize=10, facecolor='white', 
           markerfirst=True, handlelength=2, handletextpad=0.5, bbox_to_anchor=(-0.5, 1.5))

sb.despine(top=True, bottom=False, left=False, right=True);
# =====================================================
# /////////////////////////////////////////////////////


plt.subplots_adjust(wspace=0.3, hspace=0.3)
plt.subplots_adjust(top=0.65)
plt.suptitle('Assessment of trip durations based on trip type over datasets\n', fontsize = 16, weight = 'bold');

# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.15.3 Assessment of trip durations based on trip type over datasets.png', dpi=300, bbox_inches='tight')

Observations:

  • The above plot depicts that the presence of outliers has a high effect on the average trip durations. When the outliers are removed, the dataset limited under 120 minutes has an average trip duration of 39 minutes for One Way trips and 16 minutes for Round Trips.
  • The dataset limited under 30 minutes has an average trip duration of 18 minutes for One Way trips and 12 minutes for Round Trips.
  • This concludes that, customers tend to travel longer when it comes to One Way trips compared to Round Trips.
  • The most frequent trip duration remained same over the various dataset for One Way trips as 28 minutes. Meaning, most of the customers who prefer longer trips are using the full extent of Base Fare trip duration when it comes to One Way trips.
  • The most frequent trip duration remained same over the various dataset for Round Trips as 5 minutes. Meaning, most of the customers prefer short trips when it comes to Round Trips.

3.15.4 Calculate the most frequent and average trip durations by bike type:

Calculate the average trip duration and the most frequent trip duration subjected to each bike type:

Note: Since any conclusion/insight drawn on the unknown bike type is not helpful anymore, the unknown bike type is excluded from the analysis.

In [276]:
# Limit the dataset that has entries greater than 1 minute duration
duration_lim = bikeshare.query(' duration_min > 1 ')

standard_mean = math.ceil(duration_lim.query(' bike_type == "Standard" ').duration_min.mean())
standard_mode = duration_lim.query(' bike_type == "Standard" ').duration_min.mode()[0]

electric_mean = math.ceil(duration_lim.query(' bike_type == "Electric" ').duration_min.mean())
electric_mode = duration_lim.query(' bike_type == "Electric" ').duration_min.mode()[0]

smart_mean = math.ceil(duration_lim.query(' bike_type == "Smart" ').duration_min.mean())
smart_mode = duration_lim.query(' bike_type == "Smart" ').duration_min.mode()[0]

print('Overall Dataset excluding 1 min')
print('='*31 + '\n')
print('Duration mean'.center(28,'-'))
print('standard_mean : ', standard_mean, 'minutes')
print('electric_mean : ', electric_mean, 'minutes')
print('smart_mean    : ', smart_mean, 'minutes')
print('\n')
print('Duration mode'.center(28,'-'))
print('standard_mode : ', standard_mode, 'minutes')
print('electric_mode : ', electric_mode, 'minutes')
print('smart_mode    : ', smart_mode, 'minutes')
Overall Dataset excluding 1 min
===============================

-------Duration mean--------
standard_mean :  31 minutes
electric_mean :  25 minutes
smart_mean    :  45 minutes


-------Duration mode--------
standard_mode :  5 minutes
electric_mode :  4 minutes
smart_mode    :  7 minutes
In [277]:
# Limit the dataset that has entries under 120 minutes duration excluding 1 min
duration_lim_120 = bikeshare.query(' duration_min > 1 and duration_min <= 120 ')

standard_mean = math.ceil(duration_lim_120.query(' bike_type == "Standard" ').duration_min.mean())
standard_mode = duration_lim_120.query(' bike_type == "Standard" ').duration_min.mode()[0]

electric_mean = math.ceil(duration_lim_120.query(' bike_type == "Electric" ').duration_min.mean())
electric_mode = duration_lim_120.query(' bike_type == "Electric" ').duration_min.mode()[0]

smart_mean = math.ceil(duration_lim_120.query(' bike_type == "Smart" ').duration_min.mean())
smart_mode = duration_lim_120.query(' bike_type == "Smart" ').duration_min.mode()[0]

print('Dataset limited under 120 min excluding 1 min')
print('='*45 + '\n')
print('Duration mean'.center(28,'-'))
print('standard_mean : ', standard_mean, 'minutes')
print('electric_mean : ', electric_mean, 'minutes')
print('smart_mean    : ', smart_mean, 'minutes')
print('\n')
print('Duration mode'.center(28,'-'))
print('standard_mode : ', standard_mode, 'minutes')
print('electric_mode : ', electric_mode, 'minutes')
print('smart_mode    : ', smart_mode, 'minutes')
Dataset limited under 120 min excluding 1 min
=============================================

-------Duration mean--------
standard_mean :  17 minutes
electric_mean :  16 minutes
smart_mean    :  31 minutes


-------Duration mode--------
standard_mode :  5 minutes
electric_mode :  4 minutes
smart_mode    :  7 minutes
In [278]:
# Limit the dataset that has entries under 30 minutes duration excluding 1 min
duration_lim_30 = bikeshare.query(' duration_min > 1 and duration_min <= 30 ')

standard_mean = math.ceil(duration_lim_30.query(' bike_type == "Standard" ').duration_min.mean())
standard_mode = duration_lim_30.query(' bike_type == "Standard" ').duration_min.mode()[0]

electric_mean = math.ceil(duration_lim_30.query(' bike_type == "Electric" ').duration_min.mean())
electric_mode = duration_lim_30.query(' bike_type == "Electric" ').duration_min.mode()[0]

smart_mean = math.ceil(duration_lim_30.query(' bike_type == "Smart" ').duration_min.mean())
smart_mode = duration_lim_30.query(' bike_type == "Smart" ').duration_min.mode()[0]

print('Dataset limited under 30 min excluding 1 min')
print('='*44 + '\n')
print('Duration mean'.center(28,'-'))
print('standard_mean : ', standard_mean, 'minutes')
print('electric_mean : ', electric_mean, 'minutes')
print('smart_mean    : ', smart_mean, 'minutes')
print('\n')
print('Duration mode'.center(28,'-'))
print('standard_mode : ', standard_mode, 'minutes')
print('electric_mode : ', electric_mode, 'minutes')
print('smart_mode    : ', smart_mode, 'minutes')
Dataset limited under 30 min excluding 1 min
============================================

-------Duration mean--------
standard_mean :  11 minutes
electric_mean :  13 minutes
smart_mean    :  16 minutes


-------Duration mode--------
standard_mode :  5 minutes
electric_mode :  4 minutes
smart_mode    :  7 minutes

Convert the most frequent and average trip durations categorized by bike type into a dataframe:

In [279]:
# convert the most frequent and average trip durations categorized by bike type into a dataframe:
duration_df = pd.DataFrame()
duration_df['dataset'] = ['< 30', '< 30', '< 30', 
                          '< 120', '< 120', '< 120',
                          'overall', 'overall', 'overall']

duration_df['bike_type'] = ['Standard', 'Electric', 'Smart', 
                            'Standard', 'Electric', 'Smart',
                            'Standard', 'Electric', 'Smart']

duration_df['duration_avg'] = [11, 13, 16,
                               17, 16, 31,
                               31, 25, 45]

duration_df['duration_mode'] = [5, 4, 7,
                                5, 4, 7,
                                5, 4, 7]
duration_df
Out[279]:
dataset bike_type duration_avg duration_mode
0 < 30 Standard 11 5
1 < 30 Electric 13 4
2 < 30 Smart 16 7
3 < 120 Standard 17 5
4 < 120 Electric 16 4
5 < 120 Smart 31 7
6 overall Standard 31 5
7 overall Electric 25 4
8 overall Smart 45 7

Plot the most frequent and average trip durations into a dataframe:

In [280]:
plt.figure(figsize = [12, 5])
flatui = ['#ff91e2',  '#60acfc', '#bd91ff']
sb.set_palette(flatui, n_colors=3, desat=0.8)

# left plot: point plot - Avg trip duration
# =====================================================
# /////////////////////////////////////////////////////
sb.set_style('white')
plt.subplot(1, 2, 1)

ax1 = sb.pointplot(data = duration_df, x = 'dataset', y = 'duration_avg', hue = 'bike_type', alpha = 1)

# improve plot aesthetics
plt.title('Avg. Trip durations\n',  weight = 'bold', fontsize = 16, color = 'dimgrey')
plt.ylabel('Avg. duration (minutes)\n', fontsize = 14)
plt.xlabel('\nTrip durations (minutes)', fontsize = 14)
plt.xticks(fontsize = 12)

locs, labels = plt.yticks()
y_ticks_new = np.arange(0, int(math.ceil(max(locs)))+10, 10)
plt.yticks(y_ticks_new, y_ticks_new, fontsize=12)

# plot empty legend
plt.legend('', frameon=False, fancybox=False)
sb.despine(top=True, bottom=False, left=False, right=True);
# =====================================================
# /////////////////////////////////////////////////////


# right plot: point plot - most frequent trip duration
# =====================================================
# /////////////////////////////////////////////////////
sb.set_style('white')
plt.subplot(1, 2, 2)

ax2 = sb.pointplot(data = duration_df, x = 'dataset', y = 'duration_mode', hue = 'bike_type', alpha = 1)

# improve plot aesthetics
plt.title('Most frequent durations\n',  weight = 'bold', fontsize = 16, color = 'dimgrey')
plt.ylabel('Duration (minutes)\n', fontsize = 14)
plt.xlabel('\nTrip durations (minutes)', fontsize = 14)
plt.xticks(fontsize = 12)

# set y-axis limits to be same as left plot and assign same yticks
plt.ylim(ax1.get_ylim())
plt.yticks(y_ticks_new, y_ticks_new, fontsize=12)

# plot legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, framealpha = 1, 
           borderpad=1, borderaxespad=1, loc = 'upper right', labelspacing=0.5, ncol = 1,  
           title='Bike type', title_fontsize=12, fontsize=10, facecolor='white', 
           markerfirst=True, handlelength=2, handletextpad=0.5)

sb.despine(top=True, bottom=False, left=False, right=True);
# =====================================================
# /////////////////////////////////////////////////////

plt.subplots_adjust(wspace=0.3, hspace=0.3)
plt.subplots_adjust(top=0.75)
plt.suptitle('Assessment of trip durations based on bike type over datasets\n', fontsize = 16, weight = 'bold');

# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.15.4 Assessment of trip durations based on bike type over datasets.png', dpi=300, bbox_inches='tight')

Observations:

  • The above plot depicts that customers prefer Smart bikes over other bike types when it comes to trips with longer durations.
  • Standard and Electric bikes have close value of average trip durations. This denotes that these bike types have almost same customer's preference when it comes to trip durations.
  • The difference between Smart bike and other bikes increases with the dataset limitation. Say Smart bikes are have high average trip duration under 120 minutes compared to 30 minutes.

3.15.5 Calculate the most frequent and average trip durations by pass type:

Calculate the average trip duration and the most frequent trip duration subjected to each pass type:

Note: Since Flex pass is introduced to employees for testing purpose, the Flex pass type is excluded from the analysis.

In [281]:
# Limit the dataset that has entries greater than 1 minute duration
duration_lim = bikeshare.query(' duration_min > 1 ')

walkup_mean = math.ceil(duration_lim.query(' pass_type == "Walk-up" ').duration_min.mean())
walkup_mode = duration_lim.query(' pass_type == "Walk-up" ').duration_min.mode()[0]

oneday_mean = math.ceil(duration_lim.query(' pass_type == "One Day" ').duration_min.mean())
oneday_mode = duration_lim.query(' pass_type == "One Day" ').duration_min.mode()[0]

monthly_mean = math.ceil(duration_lim.query(' pass_type == "Monthly" ').duration_min.mean())
monthly_mode = duration_lim.query(' pass_type == "Monthly" ').duration_min.mode()[0]

annual_mean = math.ceil(duration_lim.query(' pass_type == "Annual" ').duration_min.mean())
annual_mode = duration_lim.query(' pass_type == "Annual" ').duration_min.mode()[0]

print('Overall Dataset excluding 1 min')
print('='*31 + '\n')
print('Duration mean'.center(28,'-'))
print('walkup_mean  : ', walkup_mean, 'minutes')
print('oneday_mean  : ', oneday_mean, 'minutes')
print('monthly_mean : ', monthly_mean, 'minutes')
print('annual_mean  : ', annual_mean, 'minutes')
print('\n')
print('Duration mode'.center(28,'-'))
print('walkup_mode  : ', walkup_mode, 'minutes')
print('oneday_mode  : ', oneday_mode, 'minutes')
print('monthly_mode : ', monthly_mode, 'minutes')
print('annual_mode  : ', annual_mode, 'minutes')
Overall Dataset excluding 1 min
===============================

-------Duration mean--------
walkup_mean  :  52 minutes
oneday_mean  :  61 minutes
monthly_mean :  15 minutes
annual_mean  :  25 minutes


-------Duration mode--------
walkup_mode  :  10 minutes
oneday_mode  :  8 minutes
monthly_mode :  5 minutes
annual_mode  :  5 minutes
In [282]:
# Limit the dataset that has entries under 120 minutes duration excluding 1 min
duration_lim_120 = bikeshare.query(' duration_min > 1 and duration_min <= 120 ')

walkup_mean = math.ceil(duration_lim_120.query(' pass_type == "Walk-up" ').duration_min.mean())
walkup_mode = duration_lim_120.query(' pass_type == "Walk-up" ').duration_min.mode()[0]

oneday_mean = math.ceil(duration_lim_120.query(' pass_type == "One Day" ').duration_min.mean())
oneday_mode = duration_lim_120.query(' pass_type == "One Day" ').duration_min.mode()[0]

monthly_mean = math.ceil(duration_lim_120.query(' pass_type == "Monthly" ').duration_min.mean())
monthly_mode = duration_lim_120.query(' pass_type == "Monthly" ').duration_min.mode()[0]

annual_mean = math.ceil(duration_lim_120.query(' pass_type == "Annual" ').duration_min.mean())
annual_mode = duration_lim_120.query(' pass_type == "Annual" ').duration_min.mode()[0]

print('Dataset limited under 120 min excluding 1 min')
print('='*45 + '\n')
print('Duration mean'.center(28,'-'))
print('walkup_mean  : ', walkup_mean, 'minutes')
print('oneday_mean  : ', oneday_mean, 'minutes')
print('monthly_mean : ', monthly_mean, 'minutes')
print('annual_mean  : ', annual_mean, 'minutes')
print('\n')
print('Duration mode'.center(28,'-'))
print('walkup_mode  : ', walkup_mode, 'minutes')
print('oneday_mode  : ', oneday_mode, 'minutes')
print('monthly_mode : ', monthly_mode, 'minutes')
print('annual_mode  : ', annual_mode, 'minutes')
Dataset limited under 120 min excluding 1 min
=============================================

-------Duration mean--------
walkup_mean  :  32 minutes
oneday_mean  :  31 minutes
monthly_mean :  12 minutes
annual_mean  :  13 minutes


-------Duration mode--------
walkup_mode  :  10 minutes
oneday_mode  :  8 minutes
monthly_mode :  5 minutes
annual_mode  :  5 minutes
In [283]:
# Limit the dataset that has entries under 30 minutes duration excluding 1 min
duration_lim_30 = bikeshare.query(' duration_min > 1 and duration_min <= 30 ')

walkup_mean = math.ceil(duration_lim_30.query(' pass_type == "Walk-up" ').duration_min.mean())
walkup_mode = duration_lim_30.query(' pass_type == "Walk-up" ').duration_min.mode()[0]

oneday_mean = math.ceil(duration_lim_30.query(' pass_type == "One Day" ').duration_min.mean())
oneday_mode = duration_lim_30.query(' pass_type == "One Day" ').duration_min.mode()[0]

monthly_mean = math.ceil(duration_lim_30.query(' pass_type == "Monthly" ').duration_min.mean())
monthly_mode = duration_lim_30.query(' pass_type == "Monthly" ').duration_min.mode()[0]

annual_mean = math.ceil(duration_lim_30.query(' pass_type == "Annual" ').duration_min.mean())
annual_mode = duration_lim_30.query(' pass_type == "Annual" ').duration_min.mode()[0]

print('Dataset limited under 30 min excluding 1 min')
print('='*44 + '\n')
print('Duration mean'.center(28,'-'))
print('walkup_mean  : ', walkup_mean, 'minutes')
print('oneday_mean  : ', oneday_mean, 'minutes')
print('monthly_mean : ', monthly_mean, 'minutes')
print('annual_mean  : ', annual_mean, 'minutes')
print('\n')
print('Duration mode'.center(28,'-'))
print('walkup_mode  : ', walkup_mode, 'minutes')
print('oneday_mode  : ', oneday_mode, 'minutes')
print('monthly_mode : ', monthly_mode, 'minutes')
print('annual_mode  : ', annual_mode, 'minutes')
Dataset limited under 30 min excluding 1 min
============================================

-------Duration mean--------
walkup_mean  :  17 minutes
oneday_mean  :  16 minutes
monthly_mean :  11 minutes
annual_mean  :  10 minutes


-------Duration mode--------
walkup_mode  :  10 minutes
oneday_mode  :  8 minutes
monthly_mode :  5 minutes
annual_mode  :  5 minutes

Convert the most frequent and average trip durations categorized by pass type into a dataframe:

In [284]:
# convert the most frequent and average trip durations categorized by pass type into a dataframe:
duration_df = pd.DataFrame()
duration_df['dataset'] = ['< 30', '< 30', '< 30', '< 30',
                          '< 120', '< 120', '< 120', '< 120',
                          'overall', 'overall', 'overall', 'overall']

duration_df['pass_type'] = ['Walk-up', 'One Day', 'Monthly', 'Annual', 
                            'Walk-up', 'One Day', 'Monthly', 'Annual',
                            'Walk-up', 'One Day', 'Monthly', 'Annual']

duration_df['duration_avg'] = [17, 16, 11, 10,
                               32, 31, 12, 13,
                               52, 61, 15, 25]

duration_df['duration_mode'] = [10, 8, 5, 5,
                                10, 8, 5, 5,
                                10, 8, 5, 5]
duration_df
Out[284]:
dataset pass_type duration_avg duration_mode
0 < 30 Walk-up 17 10
1 < 30 One Day 16 8
2 < 30 Monthly 11 5
3 < 30 Annual 10 5
4 < 120 Walk-up 32 10
5 < 120 One Day 31 8
6 < 120 Monthly 12 5
7 < 120 Annual 13 5
8 overall Walk-up 52 10
9 overall One Day 61 8
10 overall Monthly 15 5
11 overall Annual 25 5

Plot the most frequent and average trip durations into a dataframe:

In [285]:
plt.figure(figsize = [12, 5])
flatui = ["#26bda7", "#9b59b6", "#3498db", "#34495e"]
sb.set_palette(flatui, n_colors=4, desat=0.8)

# left plot: point plot - Avg trip duration
# =====================================================
# /////////////////////////////////////////////////////
sb.set_style('white')
plt.subplot(1, 2, 1)

ax1 = sb.pointplot(data = duration_df, x = 'dataset', y = 'duration_avg', hue = 'pass_type', alpha = 0.8)

# improve plot aesthetics
plt.title('Avg. Trip durations\n',  weight = 'bold', fontsize = 16, color = 'dimgrey')
plt.ylabel('Avg. duration (minutes)\n', fontsize = 14)
plt.xlabel('\nTrip durations (minutes)', fontsize = 14)
plt.xticks(fontsize = 12)

locs, labels = plt.yticks()
y_ticks_new = np.arange(0, int(math.ceil(max(locs)))+10, 10)
plt.yticks(y_ticks_new, y_ticks_new, fontsize=12)

# plot empty legend
plt.legend('', frameon=False, fancybox=False)
sb.despine(top=True, bottom=False, left=False, right=True);
# =====================================================
# /////////////////////////////////////////////////////


# right plot: point plot - most frequent trip duration
# =====================================================
# /////////////////////////////////////////////////////
sb.set_style('white')
plt.subplot(1, 2, 2)

g = sb.pointplot(data = duration_df, x = 'dataset', y = 'duration_mode', hue = 'pass_type', alpha = 0.8)

# improve plot aesthetics
plt.title('Most frequent durations\n',  weight = 'bold', fontsize = 16, color = 'dimgrey')
plt.ylabel('Duration (minutes)\n', fontsize = 14)
plt.xlabel('\nTrip durations (minutes)', fontsize = 14)
plt.xticks(fontsize = 12)
plt.ylim(ax1.get_ylim())

# set y-axis limits to be same as left plot and assign same yticks
plt.ylim(ax1.get_ylim())
plt.yticks(y_ticks_new, y_ticks_new, fontsize=12)

# plot legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, framealpha = 1, 
           borderpad=1, borderaxespad=1, loc = 'upper right', labelspacing=0.5, ncol = 1,  
           title='Pass type', title_fontsize=12, fontsize=10, facecolor='white', 
           markerfirst=True, handlelength=2, handletextpad=0.5)

sb.despine(top=True, bottom=False, left=False, right=True);
# =====================================================
# /////////////////////////////////////////////////////

plt.subplots_adjust(wspace=0.3, hspace=0.3)
plt.subplots_adjust(top=0.75)
plt.suptitle('Assessment of trip durations based on pass type over datasets\n', fontsize = 16, weight = 'bold');

# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.15.5 Assessment of trip durations based on pass type over datasets.png', dpi=300, bbox_inches='tight')

Observations:

The above plot depicts that bike rides subjected to short-term subscriptions like One Day and Walk-up pases have higher average trip durations compared to longterm subscriptions say Monthly and Annual pass. And the difference increases with the trip durations limit.

3.15.6 Calculate the most frequent and average trip durations by fare type:

Calculate the average trip duration and the most frequent trip duration subjected to each fare type:

In [286]:
# Limit the dataset that has entries greater than 1 minute duration
duration_lim = bikeshare.query(' duration_min > 1 ')

base_mean = math.ceil(duration_lim.query(' fare_type == "Base" ').duration_min.mean())
base_mode = duration_lim.query(' fare_type == "Base" ').duration_min.mode()[0]

extended_mean = math.ceil(duration_lim.query(' fare_type == "Extended" ').duration_min.mean())
extended_mode = duration_lim.query(' fare_type == "Extended" ').duration_min.mode()[0]

print('Overall Dataset excluding 1 min')
print('='*31 + '\n')
print('Duration mean'.center(28,'-'))
print('base_mean     : ', base_mean, 'minutes')
print('extended_mean : ', extended_mean, 'minutes')
print('\n')
print('Duration mode'.center(28,'-'))
print('base_mode     : ', base_mode, 'minutes')
print('extended_mode : ', extended_mode, 'minutes')
Overall Dataset excluding 1 min
===============================

-------Duration mean--------
base_mean     :  12 minutes
extended_mean :  119 minutes


-------Duration mode--------
base_mode     :  6 minutes
extended_mode :  31 minutes
In [287]:
# Limit the dataset that has entries under 120 minutes duration excluding 1 min
duration_lim_120 = bikeshare.query(' duration_min > 1 and duration_min <= 120 ')

base_mean = math.ceil(duration_lim_120.query(' fare_type == "Base" ').duration_min.mean())
base_mode = duration_lim_120.query(' fare_type == "Base" ').duration_min.mode()[0]

extended_mean = math.ceil(duration_lim_120.query(' fare_type == "Extended" ').duration_min.mean())
extended_mode = duration_lim_120.query(' fare_type == "Extended" ').duration_min.mode()[0]

print('Dataset limited under 120 min excluding 1 min')
print('='*45 + '\n')
print('Duration mean'.center(28,'-'))
print('base_mean     : ', base_mean, 'minutes')
print('extended_mean : ', extended_mean, 'minutes')
print('\n')
print('Duration mode'.center(28,'-'))
print('base_mode     : ', base_mode, 'minutes')
print('extended_mode : ', extended_mode, 'minutes')
Dataset limited under 120 min excluding 1 min
=============================================

-------Duration mean--------
base_mean     :  12 minutes
extended_mean :  57 minutes


-------Duration mode--------
base_mode     :  6 minutes
extended_mode :  31 minutes
In [288]:
# Limit the dataset that has entries under 30 minutes duration excluding 1 min
duration_lim_30 = bikeshare.query(' duration_min > 1 and duration_min <= 30 ')

base_mean = math.ceil(duration_lim_30.query(' fare_type == "Base" ').duration_min.mean())
base_mode = duration_lim_30.query(' fare_type == "Base" ').duration_min.mode()[0]

# extended fare statistics are not calculated as they do not exist under 30 minutes
print('Dataset limited under 30 min excluding 1 min')
print('='*44 + '\n')
print('Duration mean'.center(28,'-'))
print('base_mean     : ', base_mean, 'minutes')
print('\n')
print('Duration mode'.center(28,'-'))
print('base_mode     : ', base_mode, 'minutes')
Dataset limited under 30 min excluding 1 min
============================================

-------Duration mean--------
base_mean     :  12 minutes


-------Duration mode--------
base_mode     :  6 minutes

Convert the most frequent and average trip durations categorized by fare type into a dataframe:

In [289]:
# convert the most frequent and average trip durations categorized by fare type into a dataframe:
duration_df = pd.DataFrame()
duration_df['dataset'] = ['< 30', '< 30',
                          '< 120', '< 120',
                          'overall', 'overall']

duration_df['fare_type'] = ['Base', 'Extended', 
                            'Base', 'Extended',
                            'Base', 'Extended']

duration_df['duration_avg'] = [12,  np.nan,
                               12, 57,
                               12, 119]

duration_df['duration_mode'] = [6,  np.nan,
                                6, 31,
                                6, 31]
duration_df
Out[289]:
dataset fare_type duration_avg duration_mode
0 < 30 Base 12.0 6.0
1 < 30 Extended NaN NaN
2 < 120 Base 12.0 6.0
3 < 120 Extended 57.0 31.0
4 overall Base 12.0 6.0
5 overall Extended 119.0 31.0

Plot the most frequent and average trip durations into a dataframe:

In [290]:
plt.figure(figsize = [12, 5])
flatui = ["#e278fa", "#787efa"]
sb.set_palette(flatui, n_colors=2, desat=0.8)

# left plot: point plot - Avg trip duration
# =====================================================
# /////////////////////////////////////////////////////
sb.set_style('white')
plt.subplot(1, 2, 1)

ax1 = sb.pointplot(data = duration_df, x = 'dataset', y = 'duration_avg', hue = 'fare_type', alpha = 0.8)

# improve plot aesthetics
plt.title('Avg. Trip durations\n',  weight = 'bold', fontsize = 16, color = 'dimgrey')
plt.ylabel('Avg. duration (minutes)\n', fontsize = 14)
plt.xlabel('\nTrip durations (minutes)', fontsize = 14)
plt.xticks(fontsize = 12)

locs, labels = plt.yticks()
y_ticks_new = np.arange(0, int(math.ceil(max(locs)))+25, 25)
plt.yticks(y_ticks_new, y_ticks_new, fontsize=12)

# plot empty legend
plt.legend('', frameon=False, fancybox=False)
sb.despine(top=True, bottom=False, left=False, right=True);

# add annotations
# -------------------------------------------------------
locs = [0, 0, 1, 1, 2, 2]
avg_rental_counts = duration_df["duration_avg"]
avg_rental_types = duration_df["fare_type"]
avg_rental_max = avg_rental_counts.max()
clrs = ['mediumpurple' if (trip == "Extended") else 'violet' for trip in avg_rental_types ]

# get the current tick locations and labels
# locs, labels = plt.xticks()

# loop through each pair of locations and labels
for loc, avg_rental_count, clr in zip(locs, avg_rental_counts, clrs):
    try:
        count = avg_rental_count
    except KeyError:
        count = 0
    # print the pct string if the count is not 'nan'
    if count == count:
        pct_string = '{:0.0f} min'.format(math.ceil(count))
        # print the annotation depending on the bar length
        plt.text(loc-0.2, count + int(avg_rental_max/20), pct_string, ha = 'center', color = 'black', fontsize = 12,
                 bbox={'pad':1.9,'alpha':0.2,'color':'none','fc':clr})
# =====================================================
# /////////////////////////////////////////////////////


# right plot: point plot - most frequent trip duration
# =====================================================
# /////////////////////////////////////////////////////
sb.set_style('white')
plt.subplot(1, 2, 2)

ax2 = sb.pointplot(data = duration_df, x = 'dataset', y = 'duration_mode', hue = 'fare_type', alpha = 0.8)

# improve plot aesthetics
plt.title('Most frequent durations\n',  weight = 'bold', fontsize = 16, color = 'dimgrey')
plt.ylabel('Duration (minutes)\n', fontsize = 14)
plt.xlabel('\nTrip durations (minutes)', fontsize = 14)
plt.xticks(fontsize = 12)
plt.ylim(ax1.get_ylim())

# set y-axis limits to be same as left plot and assign same yticks
plt.ylim(ax1.get_ylim())
plt.yticks(y_ticks_new, y_ticks_new, fontsize=12)

# plot legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, framealpha = 1, 
           borderpad=1, borderaxespad=1, loc = 'upper right', labelspacing=0.5, ncol = 1,  
           title='Pass type', title_fontsize=12, fontsize=10, facecolor='white', 
           markerfirst=True, handlelength=2, handletextpad=0.5)

sb.despine(top=True, bottom=False, left=False, right=True);

# add annotations
# -------------------------------------------------------
locs = [0, 0, 1, 1, 2, 2]
freq_rental_counts = duration_df["duration_mode"]
freq_rental_types = duration_df["fare_type"]
freq_rental_max = freq_rental_counts.max()
clrs = ['mediumpurple' if (trip == "Extended") else 'violet' for trip in freq_rental_types ]

# loop through each pair of locations and labels
for loc, freq_rental_count, clr in zip(locs, freq_rental_counts, clrs):
    try:
        count = freq_rental_count
    except KeyError:
        count = 0
    # print the pct string if the count is not 'nan'
    if count == count:
        pct_string = '{:0.0f} min'.format(math.ceil(count))
        # print the annotation depending on the bar length
        plt.text(loc, count + int(freq_rental_max/5), pct_string, ha = 'center', color = 'black', fontsize = 12,
                 bbox={'pad':1.9,'alpha':0.2,'color':'none','fc':clr})
# =====================================================
# /////////////////////////////////////////////////////

plt.subplots_adjust(wspace=0.3, hspace=0.3)
plt.subplots_adjust(top=0.75)
plt.suptitle('Assessment of trip durations based on fare type over datasets\n', fontsize = 16, weight = 'bold');

# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.15.6 Assessment of trip durations based on fare type over datasets.png', dpi=300, bbox_inches='tight')

Observations:

  • The above plots depicts that bike trips subjected to Extended fares has a heavy presence of outliers. When outliers are removed and dataset is limited to 120 minutes the average trip duration subjected to Extended Fares is 57 minutes.
  • The average trip duration subjected to Base Fares is 12 minutes, while frequent trip duration is 6 minutes.
  • The most frequent trip duration of Extended Fare is 31 minutes. This denotes that many customers who intended to return the bikes for Base Fare failed in returning the bike under 30 minutes by a margin of 1 minute.

----------------------------------------------

3.15.7 Insights:

  1. It appears that the most of the customers prefer to take the rides with either Normal trip durations or small trip durations and avoids trips with Long and Very Long durations.
  2. The average trip duration of the Overall dataset is 30 minutes. However when the outliers are removed and re-evaluated, the average trip duration is down to 18 minutes.
  3. Also for trips under 30 minutes, the average trip duration is 12 minutes.
  4. The most frequent trip duration remains 6 minutes for all datasets limited by trip durations. This denoted that most customers rent the bikes for a quick short trip.
  5. The customers tend to travel longer when it comes to One Way trips compared to Round Trips.
  6. The most frequent trip duration remained same over the various dataset for One Way trips as 28 minutes. Meaning, most of the customers who prefer longer trips are using the full extent of Base Fare trip duration when it comes to One Way trips.
  7. The most frequent trip duration remained same over the various dataset for Round Trips as 5 minutes. Meaning, most of the customers prefer short trips when it comes to Round Trips.
  8. The customers prefer Smart bikes over other bike types when it comes to trips with longer durations.
  9. The difference between Smart bike and other bikes increases with the dataset limitation. Say Smart bikes are have high average trip duration under 120 minutes compared to 30 minutes.
  10. Standard and Electric bikes have close value of average trip durations. This denotes that these bike types have almost same customer's preference when it comes to trip durations.
  11. The bike rides subjected to short-term subscriptions like One Day and Walk-up pases have higher average trip durations compared to longterm subscriptions say Monthly and Annual pass. And the difference increases with the trip durations limit.
  12. The bike trips subjected to Extended fares has a heavy presence of outliers. When outliers are removed and dataset is limited to 120 minutes the average trip duration subjected to Extended Fares is 57 minutes.
  13. The most frequent trip duration of Extended Fare is 31 minutes. This denotes that many customers who intended to return the bikes for Base Fare failed in returning the bike under 30 minutes by a margin of 1 minute.

3.15.8 Reforms proposed:

  1. Most of the customers prefer to take the rides with either Normal trip durations or small trip durations and avoids trips with Long and Very Long durations. Even if this behaviour is to be expected, encouraging customers to ride the bikes for longer durations will generate income and improve business metrics.
  2. Organizing 2hr bike rallies and other events will attract enthusiasts to ride the bike for longer durations.
  3. Announcing Low Fares for tourists will attract them to rent the bike for longer durations.

3.16 What is the average trip distance and the most frequent trip distance among the bike rentals? Expore the factors that influence the bike ride distance.

  • Column: distance_miles
  • Data type: numerical, continuous
  • Plot : Bar chart, Pie plot

3.16.1 Categorical distribution of trip distances:

In [291]:
# compute the descriptive statistcs of trip distances
bikeshare.distance_miles.describe()
Out[291]:
count    808589.000000
mean          0.709956
std           0.692094
min           0.000000
25%           0.310000
50%           0.580000
75%           0.970000
max          24.940000
Name: distance_miles, dtype: float64

As the distance (displacement in this scenario) is dependent on the start_station co-ordinates and end_station co-ordinates, the entries with Round Trips will have 0 miles extracted as distance_milessince we basically calculated displacement. Hence categotize the trips with 0 miles as Round Trips.

Breakdown the trip distances into categories and convert into a dataframe:

In [292]:
distances = {'trip_type' : pd.Series(['Round Trip', 'Very Small', 'Small', 'Normal', 'Long', 'Very Long']), 
             'trip_count' : pd.Series([bikeshare.query(' distance_miles == 0 ').shape[0], 
                                       bikeshare.query(' distance_miles > 0 and distance_miles < 0.1 ').shape[0],
                                       bikeshare.query(' distance_miles >= 0.1 and distance_miles < 0.5 ').shape[0], 
                                       bikeshare.query(' distance_miles >= 0.5 and distance_miles < 1 ').shape[0], 
                                       bikeshare.query(' distance_miles >= 1 and distance_miles < 10 ').shape[0],
                                       bikeshare.query(' distance_miles >= 10 ').shape[0]])}

# create Dataframe. 
trip_distances = pd.DataFrame(distances)
trip_distances
Out[292]:
trip_type trip_count
0 Round Trip 124322
1 Very Small 3341
2 Small 216254
3 Normal 272339
4 Long 191737
5 Very Long 596

Plot the categorical distribution of trip durations:

Bar chart:

In [293]:
# Assign grid and color palette as per requirement
plt.figure(figsize = [32, 8])
sb.set_style("white")

# plot pre-calculations
base_color = sb.color_palette()[0]
dist_order = ['Very Long', 'Long', 'Normal', 'Small', 'Very Small', 'Round Trip']
time_order = ['[10, )', '[1, 10)', '[0.5, 1)', '[0.1, 0.5)', '(0, 0.1)', '[0]']
trip_counts = trip_distances.trip_count
trip_order = trip_distances.trip_type
x_tick_values = np.arange(0, trip_counts.max() + 50000, 50000)
x_tick_names = ['{:0.0f} K'.format(v/1000) for v in x_tick_values]
y_tick_values = [0, 1, 2, 3, 4, 5]
y_tick_names = dist_order
clrs = ['indianred', '#585370', '#585370', '#585370', 'indianred', '#674c78']

# bar plot
sb.barplot(x = trip_counts, y = trip_order, order = dist_order, palette=clrs, alpha= 1, saturation = 1)

# plot - visual enhancements
plt.title('Categorical distribution of Trip distances\n', weight = 'bold', fontsize = 30)
plt.xticks(x_tick_values, x_tick_names, fontsize = 22)
plt.yticks(y_tick_values, y_tick_names, fontsize = 22)
plt.xlabel('\nNumber of trips (thousands)', fontsize = 26)
plt.ylabel('Distance type\n', fontsize = 26)

# Create a legend:
# -------------------------------------------------------
indents = [10, 13, 12, 14, 11, 11]
# Plot empty lists with the desired label
for dist, time, indent in zip(dist_order, time_order, indents):
    plt.scatter([], [], c='k', alpha=0.3,
                label= '{}'.format(dist).ljust(indent, ' ') + ' - ' + '{}'.format(time))
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=True, ncol = 1, framealpha = 1, 
           borderpad=1, borderaxespad=1, bbox_to_anchor = (1, 0.5), loc = 6, labelspacing=0.5,  
           title='Distance - miles', title_fontsize=24, fontsize=22, facecolor='white', 
           markerfirst=True, handlelength=0.5, handletextpad=0.5)
# -------------------------------------------------------

sb.despine();

# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.16.1 Categorical distribution of Trip distances.png', dpi=300, bbox_inches='tight')

Observation:

  • The above plot deipcts that there is a fair amount of bike rental distribution when it comes to trip distances of Small, Normal and Large. This represents a productive business model covering all kinds of customer needs.
  • Only a small number of customers exists who take Very Small and Very Long bike rides. This is also a good sign, as Very Small rides represent customer dissatisfaction and having more number of Very Long rides will result in dissipations of bike availability among a geographical location.
  • A good amount of customers preferred Round Trips.

Reform:

  • Organizing long bike rallies and other events will attract enthusiasts to ride the bike for longer distances.

Deeper exploration of the factors that influence the trip distances for hidden insights:

Limit the dataset to 3 miles to minimize the influence of outliers:

In [294]:
# calculate the percentage of the dataset that falls under `3 miles` trip duration.
data_percent = np.round((bikeshare.query(' distance_miles <= 3 ').shape[0]/bikeshare.shape[0])*100, 2)
print('percent of trips that fall under 3 miles trip distances is : {} %'.format(data_percent))
percent of trips that fall under 3 miles trip distances is : 99.5 %
  1. The statistical analysis performed on the trip distances are effected by the presence of outliers. Upon calculation of the descriptive statistics, the dataset entries limited to 3 miles constitute 99% of the distance distribution.
  1. The Round Trip entries have a distance/displacement equal to Zero and are clustered together unlike One Way trips which are distributed between 1-25 miles. Hence remove the entries with distance value 0 in the further analysis to obtain the correct stats.

3.16.2 Calculate the most frequent and average trip distances by trip type:

Limit the dataset to the entries under 3 miles trip distances. As round trips are involved, entries with 0 distances are included.

In [295]:
# Limit the dataset to the entries under 3 miles distance
distance_lim_3 = bikeshare.query(' distance_miles <= 3 ')

Calculate the average trip distance and the most frequent trip distance subjected to each trip type.

In [296]:
plt.figure(figsize = [12, 5])
sb.set_palette(palette = "GnBu_d", n_colors = 2, desat = None)
base_color = sb.color_palette()[0]

# left plot: bar chart - Avg trip duration
# =====================================================
# /////////////////////////////////////////////////////
sb.set_style('white')
plt.subplot(1, 2, 1)

ax1 = sb.barplot(data = distance_lim_3, x = "trip_type", y = "distance_miles", hue = 'trip_type')

# improve plot aesthetics
plt.title('Avg. Trip distance\n\n',  weight = 'bold', fontsize = 16, color = 'dimgrey')
plt.ylabel('Avg. distance (minutes)\n', fontsize = 14)
plt.xlabel('\nTrip type', fontsize = 14)
plt.xticks(fontsize = 12)
plt.yticks(fontsize=12)

# plot empty legend
plt.legend('', frameon=False, fancybox=False)

# add annotations
# -------------------------------------------------------
for p in ax1.patches:
    ax1.annotate(format(p.get_height(), '.1f'), (p.get_x() + p.get_width() / 2., p.get_height()), 
                 ha = 'center', va = 'center', xytext = (0, 10), textcoords = 'offset points', fontsize = 12)
# -------------------------------------------------------

sb.despine(top=True, bottom=False, left=False, right=True);
# =====================================================
# /////////////////////////////////////////////////////


# right plot: bar chart - most frequent trip duration
# =====================================================
# /////////////////////////////////////////////////////
sb.set_style('white')
plt.subplot(1, 2, 2)

# prepare the data for the plot
oneway_mode = distance_lim_3.query(' trip_type == "One Way" ').distance_miles.mode()[0]
roundtrip_mode = distance_lim_3.query(' trip_type == "Round Trip" ').distance_miles.mode()[0]
heights = [oneway_mode, roundtrip_mode]
labels = distance_lim_3.trip_type.sort_values(ascending=True).unique()

g = sb.barplot(x = labels, y = heights, hue = labels)

# improve plot aesthetics
plt.title('Most frequent trip distance\n\n',  weight = 'bold', fontsize = 16, color = 'dimgrey')
plt.xlabel('\nTrip type', fontsize = 14)
plt.ylabel('Distance (miles)\n', fontsize = 14)
plt.xticks(fontsize = 12)
plt.ylim(ax1.get_ylim())
plt.yticks(fontsize=12)

# plot legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, framealpha = 1, 
           borderpad=1, borderaxespad=1, loc = 'upper right', labelspacing=0.5, ncol = 1,  
           title='Trip type', title_fontsize=12, fontsize=10, facecolor='white', 
           markerfirst=True, handlelength=2, handletextpad=0.5)

# add annotations
# -------------------------------------------------------
for p in g.patches:
    g.annotate(format(p.get_height(), '.1f'), (p.get_x() + p.get_width() / 2., p.get_height()), 
               ha = 'center', va = 'center', xytext = (0, 10), textcoords = 'offset points', fontsize = 12)
# -------------------------------------------------------

sb.despine(top=True, bottom=False, left=False, right=True);
# =====================================================
# /////////////////////////////////////////////////////


plt.subplots_adjust(wspace=0.3, hspace=0.3)
plt.subplots_adjust(top=0.7)
plt.suptitle('Assessment of trip distances under 3 miles based on trip type\n', fontsize = 16, weight = 'bold');

# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.16.2 Assessment of trip distances under 3 miles based on trip type.png', dpi=300, bbox_inches='tight')

Observation:

  • One Way trips has an average trip duration of 0.8 miles, while the most frequent trip duration is 0.5 miles.
  • As Round Trips constitute a total displacement of 0, the statistical analysis cannot be performed on it.

3.16.3 Calculate the most frequent and average trip distances by bike type:

Note: Since any conclusion/insight drawn on the unknown bike type is not helpful anymore, the unknown bike type is excluded from the analysis.

Limit the dataset to the entries under 3 miles and remove entries with 0 trip distances:

In [297]:
# Limit the dataset to the entries under 3 miles distance and remove entries with '0' distances
distance_lim_3 = bikeshare.query(' distance_miles <= 3  and distance_miles > 0 and bike_type != "unknown" ').copy()

# categorize the bike type variable
level_order = ['Standard', 'Electric', 'Smart']
ordered_cat = pd.api.types.CategoricalDtype(ordered = True, categories = level_order)
distance_lim_3['bike_type'] = distance_lim_3['bike_type'].astype(ordered_cat)

Calculate the average trip distance and the most frequent trip distance subjected to each bike type.

In [298]:
# Assign color palette and figure size as per requirement 
plt.figure(figsize = [12, 5])
sb.set_style('white')
flatui = ['#60acfc', '#91ffda', '#bd91ff']
sb.set_palette(flatui, n_colors=4, desat=0.8)


# left plot: bar chart - Avg trip duration
# =====================================================
# /////////////////////////////////////////////////////
sb.set_style('white')
plt.subplot(1, 2, 1)

ax1 = sb.barplot(data = distance_lim_3, x = "bike_type", y = "distance_miles", hue = 'bike_type', dodge=False)

# improve plot aesthetics
plt.title('Avg. Trip distance\n\n',  weight = 'bold', fontsize = 16, color = 'dimgrey')
plt.ylabel('Avg. distance (minutes)\n', fontsize = 14)
plt.xlabel('\nBike type', fontsize = 14)
plt.xticks(fontsize = 12)
plt.yticks(fontsize=12)

# plot empty legend
plt.legend('', frameon=False, fancybox=False)

# add annotations
# -------------------------------------------------------
for p in ax1.patches:
    ax1.annotate(format(p.get_height(), '.1f'), (p.get_x() + p.get_width() / 2., p.get_height()), 
                 ha = 'center', va = 'center', xytext = (0, 10), textcoords = 'offset points', fontsize = 12)
# -------------------------------------------------------

sb.despine(top=True, bottom=False, left=False, right=True);
# =====================================================
# /////////////////////////////////////////////////////


# right plot: bar chart - most frequent trip duration
# =====================================================
# /////////////////////////////////////////////////////
sb.set_style('white')
plt.subplot(1, 2, 2)

# prepare the data for the plot
standard_mode = distance_lim_3.query(' bike_type == "Standard" ').distance_miles.mode()[0]
electric_mode = distance_lim_3.query(' bike_type == "Electric" ').distance_miles.mode()[0]
smart_mode = distance_lim_3.query(' bike_type == "Smart" ').distance_miles.mode()[0]
heights = [standard_mode, electric_mode, smart_mode]
labels = distance_lim_3.bike_type.sort_values(ascending=True).unique()

g = sb.barplot(x = labels, y = heights, hue = labels, dodge=False)

# improve plot aesthetics
plt.title('Most frequent trip distance\n\n',  weight = 'bold', fontsize = 16, color = 'dimgrey')
plt.xlabel('\nBike type', fontsize = 14)
plt.ylabel('Distance (miles)\n', fontsize = 14)
plt.xticks(fontsize = 12)
plt.ylim(ax1.get_ylim())
plt.yticks(fontsize=12)

# plot legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, framealpha = 1, 
           borderpad=1, borderaxespad=1, loc = 'upper left', labelspacing=0.5, ncol = 1,  
           title='Bike type', title_fontsize=12, fontsize=10, facecolor='white', 
           markerfirst=True, handlelength=2, handletextpad=0.5)

# add annotations
# -------------------------------------------------------
for p in g.patches:
    g.annotate(format(p.get_height(), '.1f'), (p.get_x() + p.get_width() / 2., p.get_height()), 
               ha = 'center', va = 'center', xytext = (0, 10), textcoords = 'offset points', fontsize = 12)
# -------------------------------------------------------

sb.despine(top=True, bottom=False, left=False, right=True);
# =====================================================
# /////////////////////////////////////////////////////


plt.subplots_adjust(wspace=0.3, hspace=0.3)
plt.subplots_adjust(top=0.7)
plt.suptitle('Assessment of trip distances under 3 miles based on bike type\n', fontsize = 16, weight = 'bold');

# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.16.3 Assessment of trip distances under 3 miles based on bike type.png', dpi=300, bbox_inches='tight')

Observation:

  • The above plot depicts that customers preferred Smart bikes for longer trip distances compared to other bike types.
  • The bike types Standard and Electric has same trip durations and frequent trip durations. This conveys that the customers have a similar preference over Standard and Electric bikes.

3.16.4 Calculate the most frequent and average trip distances by pass type:

Note: Since Flex pass is introduced to employees for testing purpose, the Flex pass type is excluded from the analysis.

Limit the dataset to the entries under 3 miles and remove entries with 0 trip distances:

In [299]:
# Limit the dataset to the entries under 3 miles distance and remove entries with '0' distances
distance_lim_3 = bikeshare.query(' distance_miles <= 3  and distance_miles > 0 and pass_type != "Flex" ').copy()

# categorize the pass type variable
level_order = ['Walk-up', 'One Day', 'Monthly', 'Annual']
ordered_cat = pd.api.types.CategoricalDtype(ordered = True, categories = level_order)
distance_lim_3['pass_type'] = distance_lim_3['pass_type'].astype(ordered_cat)

Calculate the average trip distance and the most frequent trip distance subjected to each pass type.

In [300]:
# Assign color palette and figure size as per requirement 
plt.figure(figsize = [12, 5])
sb.set_style('white')
flatui = ["#26bda7", "#9b59b6", "#3498db", "#34495e"]
sb.set_palette(flatui, n_colors=4, desat=0.8)


# left plot: bar chart - Avg trip duration
# =====================================================
# /////////////////////////////////////////////////////
sb.set_style('white')
plt.subplot(1, 2, 1)

ax1 = sb.barplot(data = distance_lim_3, x = "pass_type", y = "distance_miles", hue = 'pass_type', dodge=False)

# improve plot aesthetics
plt.title('Avg. Trip distance\n\n',  weight = 'bold', fontsize = 16, color = 'dimgrey')
plt.ylabel('Avg. distance (minutes)\n', fontsize = 14)
plt.xlabel('\nPass type', fontsize = 14)
plt.xticks(fontsize = 12)
plt.yticks(fontsize=12)

# plot empty legend
plt.legend('', frameon=False, fancybox=False)

# add annotations
# -------------------------------------------------------
for p in ax1.patches:
    ax1.annotate(format(p.get_height(), '.1f'), (p.get_x() + p.get_width() / 2., p.get_height()), 
                 ha = 'center', va = 'center', xytext = (0, 10), textcoords = 'offset points', fontsize = 12)
# -------------------------------------------------------

sb.despine(top=True, bottom=False, left=False, right=True);
# =====================================================
# /////////////////////////////////////////////////////


# right plot: bar chart - most frequent trip duration
# =====================================================
# /////////////////////////////////////////////////////
sb.set_style('white')
plt.subplot(1, 2, 2)

# prepare the data for the plot
walkup_mode = distance_lim_3.query(' pass_type == "Walk-up" ').distance_miles.mode()[0]
oneday_mode = distance_lim_3.query(' pass_type == "One Day" ').distance_miles.mode()[0]
monthly_mode = distance_lim_3.query(' pass_type == "Monthly" ').distance_miles.mode()[0]
annual_mode = distance_lim_3.query(' pass_type == "Annual" ').distance_miles.mode()[0]
heights = [walkup_mode, oneday_mode, monthly_mode, annual_mode]
labels = distance_lim_3.pass_type.sort_values(ascending=True).unique()

ax2 = sb.barplot(x = labels, y = heights, hue = labels, dodge=False)

# improve plot aesthetics
plt.title('Most frequent trip distance\n\n',  weight = 'bold', fontsize = 16, color = 'dimgrey')
plt.xlabel('\nPass type', fontsize = 14)
plt.ylabel('Distance (miles)\n', fontsize = 14)
plt.xticks(fontsize = 12)
plt.yticks(fontsize=12)

# plot legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, framealpha = 1, 
           borderpad=1, borderaxespad=1, loc = 'upper right', labelspacing=0.5, ncol = 1,  
           title='Pass type', title_fontsize=12, fontsize=10, facecolor='white', 
           markerfirst=True, handlelength=2, handletextpad=0.5)

# add annotations
# -------------------------------------------------------
for p in ax2.patches:
    ax2.annotate(format(p.get_height(), '.1f'), (p.get_x() + p.get_width() / 2., p.get_height()), 
                 ha = 'center', va = 'center', xytext = (0, 10), textcoords = 'offset points', fontsize = 12)
# -------------------------------------------------------

sb.despine(top=True, bottom=False, left=False, right=True);
# =====================================================
# /////////////////////////////////////////////////////

# adjust the two plots to have the same ylimits/yticks
if ax1.get_ylim() < ax2.get_ylim():
    ax1.set_ylim(ax2.get_ylim())
else:
    ax2.set_ylim(ax1.get_ylim())

plt.subplots_adjust(wspace=0.3, hspace=0.3)
plt.subplots_adjust(top=0.7)
plt.suptitle('Assessment of trip distances under 3 miles based on pass type\n', fontsize = 16, weight = 'bold');

# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.16.4 Assessment of trip distances under 3 miles based on pass type.png', dpi=300, bbox_inches='tight')

Observation:

  • The above plot depicts that the rides subjected to Monthly pass has lower average bike rentals compared to other pass types.
  • The short-term passes say One Day and Walk-up has higher value of frequent trip duration compared to longterm subscriptions. This is because the longterm passes are opted by working individuals for a ride to work, while short-term passes are preferred by tourists and explorers to explore city for longer rides.

3.16.5 Calculate the most frequent and average trip distances by fare type:

Limit the dataset to the entries under 3 miles and remove entries with 0 trip distances:

In [301]:
# Limit the dataset to the entries under 3 miles distance and remove entries with '0' distances
distance_lim_3 = bikeshare.query(' distance_miles <= 3  and distance_miles > 0 ').copy()

Calculate the average trip distance and the most frequent trip distance subjected to each fare type.

In [302]:
# Assign color palette and figure size as per requirement 
plt.figure(figsize = [12, 5])
flatui = ["#fff480"]
sb.set_palette(flatui, n_colors=1, desat=0.8)
base_color = sb.color_palette()[0]

# left plot: point plot - Avg trip duration
# =====================================================
# /////////////////////////////////////////////////////
sb.set_style('white')
plt.subplot(1, 2, 1)
ax1 = sb.barplot(data = distance_lim_3, x = "fare_type", y = "distance_miles", 
                 hue = 'fare_type', alpha = 0.8, dodge=False)
# improve plot aesthetics
plt.title('Avg. Trip distance\n\n',  weight = 'bold', fontsize = 16, color = 'dimgrey')
plt.ylabel('Avg. distance (miles)\n', fontsize = 14)
plt.xlabel('\nFare type', fontsize = 14)
plt.xticks(fontsize = 12)
plt.yticks(fontsize=12)

# plot empty legend 
plt.legend('', frameon=False, fancybox=False)

# add annotations
# -------------------------------------------------------
for p in ax1.patches:
    ax1.annotate(format(p.get_height(), '.1f'), (p.get_x() + p.get_width() / 2., p.get_height()), 
                 ha = 'center', va = 'center', xytext = (0, 10), textcoords = 'offset points', fontsize = 12)
# -------------------------------------------------------

sb.despine(top=True, bottom=False, left=False, right=True);
# =====================================================
# /////////////////////////////////////////////////////


# right plot: Bar chart - most frequent trip duration
# =====================================================
# /////////////////////////////////////////////////////
sb.set_style('white')
plt.subplot(1, 2, 2)

# prepare the data for the plot
base_mode = distance_lim_3.query(' fare_type == "Base" ').distance_miles.mode()[0]
extended_mode = distance_lim_3.query(' fare_type == "Extended" ').distance_miles.mode()[0]
heights = [base_mode, extended_mode]
labels = distance_lim_3.fare_type.sort_values(ascending=True).unique()

ax2 = sb.barplot(x = labels, y = heights, hue = labels, alpha = 0.8, dodge=False)

# improve plot aesthetics
plt.title('Most frequent trip distance\n\n',  weight = 'bold', fontsize = 16, color = 'dimgrey')
plt.xlabel('\nFare type', fontsize = 14)
plt.ylabel('Distance (miles)\n', fontsize = 14)
plt.xticks(fontsize = 12)
plt.yticks(fontsize=12)

# plot legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, framealpha = 1, 
           borderpad=1, borderaxespad=1, loc = 'upper left', labelspacing=0.5, ncol = 1,  
           title='Pass type', title_fontsize=12, fontsize=10, facecolor='white', 
           markerfirst=True, handlelength=2, handletextpad=0.5)

# add annotations
# -------------------------------------------------------
for p in ax2.patches:
    ax2.annotate(format(p.get_height(), '.1f'), (p.get_x() + p.get_width() / 2., p.get_height()), 
                 ha = 'center', va = 'center', xytext = (0, 10), textcoords = 'offset points', fontsize = 12)
# -------------------------------------------------------

sb.despine(top=True, bottom=False, left=False, right=True);
# =====================================================
# /////////////////////////////////////////////////////


# adjust the two plots to have the same y axis limits
if ax1.get_ylim() < ax2.get_ylim():
    ax1.set_ylim(ax2.get_ylim())
else:
    ax2.set_ylim(ax1.get_ylim())

plt.subplots_adjust(wspace=0.3, hspace=0.3)
plt.subplots_adjust(top=0.7)
plt.suptitle('Assessment of trip distances under 3 miles based on fare type\n', fontsize = 16, weight = 'bold');

# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.16.5 Assessment of trip distances under 3 miles based on fare type.png', dpi=300, bbox_inches='tight')

Observation:

  • The above plot depicts that trips with Extended Fares has a high value of frequent trip duration compared to Base Fare. This is to be expected as trip distances and trip durations are correlated.

----------------------------------------------

3.16.6 Insights:

  1. There is a fair amount of bike rental distribution when it comes to trip distances of Small, Normal and Large. This represents a productive business model covering all kinds of customer needs.
  2. Only a small number of customers exists who take Very Small and Very Long bike rides. This is also a good sign, as Very Small rides represent customer dissatisfaction and having more number of Very Long rides will result in dissipations of bike availability among a geographical location.
  3. A good amount of customers preferred Round Trips.
  4. One Way trips has an average trip duration of 0.8 miles, while the most frequent trip duration is 0.5 miles.
  5. Customers preferred Smart bikes for longer trip distances compared to other bike types.
  6. The bike rides subjected to Monthly pass has lower average bike rentals compared to other pass types.
  7. The short-term passes say One Day and Walk-up has higher value of frequent trip duration compared to longterm subscriptions. This is because the longterm passes are opted by working individuals for a ride to work, while short-term passes are preferred by tourists and explorers to explore city for longer rides.
  8. The trips with Extended Fares has a high value of frequent trip duration compared to Base Fare. This is to be expected as trip distances and trip durations are correlated.

3.16.7 Reforms proposed:

  1. Organizing long bike rallies and other events will attract enthusiasts to ride the bike for longer distances.

-------- End of 3. Explanatory Data Analysis --------


4. Investigation Summary:

=====================================

4.1 Does the customers prefer one way trips compare to round trips?

Insights:
Insight 1 The aggregated distribution of bike rentals over all years, suggest that the customers prefer One Way trips compared to Round Trip's for bike rental with a grey area in the Early hours of the day, where the average number of bike rentals are very low and statistically not significant for comparision.
Insight 2 The average number of bike rentals subjected to One Way trips decreases during Saturaday's and Sunday's, while Round Trips experiece a slight increase.
Insight 3 The first half of the year 2019 experiences a relatively low number of bike rentals subjected to One Way trips, compared to later half of the year. This trend is not limited to year 2019 but consistent over the other years.
Insight 4 The Standard bike rentals subjected to One Way trips has reduced significantly during 2019 compared to previous year 2018. At the same there is a noticable increase in Electric bike rentals subjected to One Way trips. This depicts that the customers preffered Electric bikes over Standard bikes for One Way trips in the year 2019. However, as the Electric bike was introduced during the end of the year 2018, the customers that used Standard bikes suddenly shifted towrds Electric bikes after the second quarter of the year 2019. This states that the customer's preference over the bike has changed but the total number of bike rentals subjected to One Way trips has not decreased.
Insight 5 Customers that pay Base fare prefer One Way trips, while the customers that pay Extended fares takes almost same number of Round Trips as of One Way trips and does not exhibit any preference over trip types.
Reforms proposed:
Reform 1 Care should be taken to increase the number of bike rentals during the end of the week. Organizing events such as Bike rally's will significantly increases the bike rentals during the holidays/weekends.
Reform 2 Announing discounts on One Way trips from stations with high bike count to stations with Low bike count during the weekdays will normalize the distribution of bike over all stations.
Reform 3 Promotions/discounts should be offered on One Way trips over the first half of the year to encourage the customers to take more number of One Way trips.
Reform 4 Having less number of customers that pay Extended fares subjected to One Way trips has a positive end result and does not require any intervene. Because if more number of customers take longer rides (as fares and bike durations are correlated) subjected to One Way trips, then the bikes will end up at a farther location from their home station and might create a gap between the supply and demand of the bikes at these stations. However, as more number of customers prefer One Way trips for less duration rides (base fare), the bikes will end up in the same geographical cluster which eases the redirection of customers to the nearby available stations in case of bike deficiency.

4.2 Are standard bikes in more demand compared to smart and electric bikes? Is the launch of smart and electric bikes in 2019, considered a success?

Insights:
Insight 1 The classification of bikes was introduced at the end of the year 2018. Hence the rentals related to unknown bike category are ignored and the analysis limited to the year 2019.
Insight 2 The bike rentals for the Standard bike type decreases over the year 2019, while the rentals for the bike type Smart and Electric increases with in the timeline of the year. Hence, even though Standard bikes are popular during the start of the year 2019, customers preferred Smart and Electric bikes towards the end of the year 2019. Hence it can be concluded that the lauch of Electric and Smart bikes are a success.
Insight 3 Even though Standard bikes are the most popular choice during the first quarter of the year, the Electric bikes gradually gained popularity among One Way trips over the rest of the year.
Insight 4 The customers that take Round Trips does not have any preference over bike types.
Insight 5 Even though Standard bikes are the most popular choice for the customers with One Day pass during the first quarter of the year, the number of bike rentals subjected to One Day pass decreased to a point that there is no significant difference in bike pereference between Standard bikes and Smart bikes towards the end of the year 2019.
Insight 6 The customers that has Monthly pass preferred Standard bikes during the first quarter of the year, however the Electric bikes gained more popularity over the rest of the year 2019.
Reforms proposed:
Reform 1 Even though Smart bikes were introduced along with the Electric bikes, they failed to gain as much popularity as of Electric bikes. Hence dicounts should be announced to increase the rental activity of Smart bikes during the peak hours, which inturn helps the stations to maintain the availabilty of other bikes types.
Reform 2 Use Smart bikes in promotional events like Bike rallies to familiarize customers with its features and encourage the customers to prefer Smart bikes in the future.

4.3 Is monthly pass, the most subscibed pass type among customers?

Insights:
Insight 1 Monthly pass has always been the most popular choice for the customers. And discontinuation of Walk-up pass in 2019 has even more increased the number of bike rentals subjectd to Monthly pass.
Insight 2 There is a slight increase in the rentals subjected to Annual pass in the year 2019.
Insight 3 Majority of the bike rentals subjected to One Way trips are taken by the customers with Monthly subscription.
Insight 4 The number bike rentals taken on One Day subscription experienced a steady decrease subjected to One Way trips over the years 2018 and 2019, which might be the reason for the increase in monthly subscibers for the second half of the year 2019.
Insight 5 Majority of the bike rentals subjected to Round Trips are taken by the customers with One Day subscription.
Reforms proposed:
Reform 1 Discounts should be announced on One Day customer pass subscription to encourage the tourists and non-subscribers to rent a bike.

4.4 Does majority of the customers utilize base fare option to reach their destintions? If yes, what percent of bike rentals generate extra income in the form of extended fares?

Insights:
Insight 1 The majority of the customers utilize base fare option to reach their destintions. However, the recent year 2019 experienced a relatively less number of bike rentals for first and second quarters as compared to the third and fourth quarters. Reforms must be taken to increase the bike rentals for the first half of the yearly timeline.
Insight 2 Around 25.5% of the bike rentals generated extra income in the form of Extended fares, which potrays a good business model. However, the average number of the bike rentals subjectd to Extended fare for the year 2019 are relatively less than 2018 and need to be increased by adopting new rentals techniques that encourage customers to ride the bikes for longer duration of time.
Reforms proposed:
Reform 1 Discounts/promotions should be announced to encourage the customers to ride bikes for longer durations.
Insights:
Insight 1 Based on classification of aggregated bike rentals over various parameters, it can be concluded that most customers prefer standard bike over smart bikes, takes more One Way trips than Round Trip's, and prefers Monthly Pass over other subscriptions.

4.6 Does a majority of customer database is compromised of working individuals?

Insights:
Insight 1 The bike rentals aggregated by the hour of the day, depicts that the rentals slowly starts to increase from 6:00 AM untill 5:00 PM with the peaks at 8:00 AM, 12:00 PM, and 5:00 PM, which represent Morning office hours, Afternoon Lunch time, and Evening office relieve timings respectively. This concludes that the huge portion of the customer database contain working individuals, who use bikes for the transportatioin.
Insight 2 The average bike rentals over the hour of the day depicts that the rentals are least during night and early hours.
Insight 3 The bike rentals decrease during the non-working days such as Saturday and Sunday. This reinforces the argument that the majority of the customer base consists of working individuals.
Insight 4 The average number of bike rentals taken by non-subscribers and tourists (Walk-up pass or One Day pass) is less than the number of bike rentals taken by Working Individuals (with Monthly pass). This reinforces the argument that the majority of customer database is compromised of working individuals.
Reforms proposed:
Reform 1 The year 2019 experieces a steep decrease in bike rentals during the non-working days compared to previous years. This reflects the failure in attraction of tourists and non-subscribers to ride a bike over weekends. Promotions should be announced for tourists and non-subscribers to encourage them to rent a bike.
Reform 2 Encouraging working individuals to ride a bike during non-working days in a week will increase the revenue generation.

4.7 How can we increase the bike rentals based on hour of the day?

Insights:
Insight 1 The rental activity is highest around Afternoon, with Morining and Evening being closest. This denotes that the customers use bike rentals the most during daytime.Subsequently the rental activity is least at Early Hours and Night times.
Reforms proposed:
Reform 1 Promoting morning fitness activities such as Morning bike challenges will potentially increase the bike rental activity during Early Hours of the day.
Reform 2 While tie up with night events will boost Night-time bike rentals.

4.8 Does bike rentals decrease during the end of the month?

Insights:
Insight 1 The bike rentals aggregated over the day of the month depicts that the rentals decrease slightly during the end of the month. However on deeper analysis of the data by calculating the average bike rentals, it is clear that the rental activity actually increases during the end of the month.
Insight 2 Also, the distribution of average bike rentals over the day of the month, ranges between 700 and 800 only. This depicts that there is no significant differance in average bike rentals subjected to any two days given in a month.

4.9 Does a weekday have any effect on the bike rentals? If the effect is negative, propose any ideas to overcome the crisis?

Insights:
Insight 1 The years 2017, and 2018 have a relatively slight decrease in average bike rentals compared to other days in the week, however the year 2019 experience a sudden drop in average bike rentals during weekends say Saturday and Sunday. This is not a good sign for a healthy business model and requires reforms.
Insight 2 The customers with long term subscriptios such as Annual pass and Monthly pass prefer Standard bikes and Electric Bikes to travel during working days/weekdays and less likely to travel during weekends. As the customers database contain a majority of working individuals, they tend to prefer One Way trips which decreases during weekends.
Insight 3 The above plot depicts that even if the customers that take One Way trips (probably working individuals who ride to work) decreases over weekends, the customers that take Round Trips increases during the weekends.
Insight 4 The pass_type holds a stronger influence on the bike rentals over the week rather than bike_type.
Insight 5 The smart bike experiences a slight increase in average bike rentals over weekends.
Insight 6 New/temporary customers with no existing pass (say tourists/travellers/activists) tend to take short term pass such as One Day pass and prefer Standard bikes and Smart bikes. Hence Smart bikes experince highest bike rentals during the weekends. Also this category of customers tend to take Round Trips and ride for longer durations resulting in Extended fares thus generating more income to the company.
Reforms proposed:
Reform 1 Organizing/promoting, fitness/recreational activities like Bike rallies will potentially increase the bike rentals on the weekends/holidays, significantly.
Reform 2 The number of customers that take One Day pass who prefer Standard bikes reduced significantly during 2019. Hence attracting this category customers to use standard bikes will enhance the business model significantly.
Reform 3 As major part of the customer database is compromised of working individuals, seize the advantage of low rentals during the weekend and take reforms to normalize the availability of bikes over the stations to support the bike rental traffic on the monday.

4.10 Are there any bike stations that has low bike rental/return activity over geographical distribution and is not scalable for maintainance?

Insights:
Insight 1 The stations with ID's (4143, 4327, 4362, 4363, 4373, 4490, 4321, 4467, 4468) has very low bike activity (rentals and returns combined) and deemed as high maintainence. The said stations does not even constitute to 10 bike activities over the span of 3 years.
Reforms proposed:
Reform 1 Hence these stations are financially not suitable for further business and need to be either terminated or relocated to locations with potential bike traffic.

4.11 Is there a gap between the demand and supply of the bikes at any given time in a day? If yes, propose a model for reducing the gap.

Insights:
Insight 1 A window period of 6 hours between (8:00 AM - 14:00 PM) experiences a shortage of supply in bikes compared to demand in bikes by the customers. However the gap in supply and demand is very lean and does not require any immediate attention.

4.12 Does bike rental traffic equally distributed over the start stations? If not, how to better optimize the start stations to increase their rental activity?

Insights:
Insight 1 The bike rental traffic is not equally distributed over the start stations. However this does not imply that these start stations are to be eliminated as they might incur good bike return traffic and still prove to be a station that procure acceptable business metrics.
Insight 2 A major number of stations subjected to Round Trips experience Low and Very Low rental traffic. This reveals the need of improving rental traffic at the start stations subjected to Round Trips.
Insight 3 The number of start stations that experience High and Very high bike rental traffic for Round Trips is less than that of One Way trips. This denotes that One Way trips are more popular among the customers.
Insight 4 There exists very small number of start stations subjected to bike types, that experience Very Low bike rental activity, which is a good sign for healthy business model. However, the number of start stations that experince Very High rental activity is also very small. This limits the usage of start stations from serving its full potential.
Insight 5 As the number of start stations that experiences Low and Very Low bike rental activity subjected to bike type are clustered closely, this unravels that the bike type has no influence on the rental activity at these stations and there might be other factors that caused the decrease in bike rental activity at these specitic stations. So no action subjected to bike type is required to increase the bike rental activity at start stations with Low and Very Low rental activity.
Insight 6 Smart bike has less number of start stations with Normal and High rental traffic compared to other bike types. This reflects that Smart bikes requires more advertisement and awareness among customers.
Insight 7 A major number of start stations subjected to Annual pass experiences either Low or Very Low rental traffic. As Annual pass is a long-term subscription, this behaviour is to be expected.
Insight 8 All the start stations subjected to Flex pass are compromised into Low and Very Low bike rental traffic. This is because Flex pass is originally issued for testing puspose for employees. Hence this insight is ignored.
Insight 9 It appears that a fair number of start stations experience Low rental traffic subjected to Monthly pass type. As Monthly pass is mostly preferred by working individuals, any action taken will not encourage them to switch their start station preference, which might result in delay to their ride to work. Hence no action is required to increase rental activity subjected to Monthly pass type at start stations with Low rentnal activity.
Insight 10 There exists many start stations with relatively Low bike rental activity, subjected to One Day pass. This might be due to the influence of its geographical location or acquisition of bike rentals related to other customer pass types.
Insight 11 A major number of start stations subjected to Extended fare types experience Low rental traffic. Action should be taken to encourage customers to take longer trips at these stations to increase income generation.
Insight 12 A major number of start stations subjected to Base fare experience Normal and Higher rental traffic. This is a good sign for healthy business model with higher average in bike rental traffic over time.
Reforms proposed:
Reform 1 Discounts or promotional activities should be announced for Round Trips to encourage customers to rent bikes from the staions with Low and Very Low bike rental activity subjected to Round Trips.
Reform 2 Annouce promotions on Smart bikes to increase their rental activity at start stations.
Reform 3 Promotions should be announced to increase the rental traffic subjected to One Day passes at start stations with Low bike rental activity.
Reform 4 Promotions should be announced to encourage customers to take longer trips at the stations with Low bike rental traffic subjected to Extended Fares, to increase income generation.

4.13 Does bike rental traffic equally distributed over the end stations? If not, how to better optimize the start stations to increase their bike return activity?

Insights:
Insight 1 The bike return traffic is not equally distributed over the end stations. There exists end stations with Very Low bike return activity. However this does not imply that these end stations are to be eliminated as they might incur good bike rental traffic and still prove to be a station that procure acceptable business metrics.
Insight 2 Many end stations subjected to One Way trips experience a Low bike return traffic. When Round Trips are involved, the bike rental and bike returns are subjected to one and same bike station. Hence when a station experience a Low bike return traffic, it also implies a Low bike rental traffic subjected to the same station. However, unlike Round Trips, having a Low bike returns subjected to One Way trips will create a deficit in the availabitiy of the bikes in the specific station. Hence an immediate action is required to redirect bikes from stations with High and Very High bike return traffic to stations with Low and Very Low bike return traffic subjected to One Way trips to normalize the availability of bikes over all staions.
Insight 3 A major number of end stations subjected to Round Trips experience Low bike return traffic.
Insight 4 There are number of end stations that experiences Low and Very Low bike return activity and are clustered closely together. This represents that the bike type has no influence on the return activity at these stations. This is because of the bike returns in correlation with the bike rentals. So no action subjected to bike type is required to increase the bike return activity at the end stations with Low and Very Low bike return activity.
Insight 5 Bike returns subjected to Annual pass type has high number of end stations with Low and Very Low return traffic. This is because, the Annual pass has high number of stations with Low and Very Low rental activity.
Insight 6 There exists many end stations with relatively Low bike rental activity, subjected to One Day pass
Insight 7 The bike returns subjected to Extended Fares incur a high number of end stations with Low return traffic and less number of stations with High return traffic. This denotes that Extended Fares are less desired by the customers.
Insight 8 The bike returns subjected to Base Fares incur a high number of end stations with High return traffic and less number of stations with Low return traffic. This denotes that Base Fares are more preferred by the customers.
Reforms proposed:
Reform 1 Promotions should be announced to encourage customers to opt for the Round Trips at the end stations with Low bike return traffic.
Reform 2 Smart bike has less number of end stations with Normal and High bike return traffic compared to other bike types. This is because Smart bikes has less number of start stations with Normal and High bike rental traffic compared to other bike types. However some stations might incur less bike returns than bike rentals subjected to Smart bikes and this scenario will result in smart bike deficiency at these partivular stations. Hence the smart bikes returns should be adjusted/redirected to match the bike demand in the specific end stations.
Reform 3 As there exists many end stations with relatively Low bike rental activity subjected to One Day pass, reforms should be taken to redirect the bike returns from other stations to match the demand for bikes at these end stations.
Reform 4 Actions should be taken to encourage the customers to ride the bikes for longer durations to incur Extended Fares thus generating more income to the company.

4.14 Is there a requirement for launching a remainder to notify the expiration of base fare to the customers?

Insights:
Insight 1 It appears that, the customers most frequently tend to return the bikes just after the Base Fare margin. Even if this scenario is one's own result of individual preference or behaviour, this might gives rise to customer's dissatisfaction of paying extra fee (extended fare), when one could avoid it with a little caution. This denotes that there is a requirement for launching a remainder to notify the expiration of Base Fare to the customers.
Insight 2 It appears 16% of the bike rides are eligible for a 5 minute grace period from charging Extended Fare.
Reforms proposed:
Reform 1 Launching a remainder by mobile notification/other sources to notify the expiration of Base Fare to the customers will alert the customer to return the bike to the nearest bike station to avoid Extended Fare will result in increased customer satisfaction.
Reform 2 The alternate option is to give a 5 minute grace period to the Extended Fares. Since the percent of rides that are eligible for 5 minute grace period is far less compared to the rides with Extended Fares, the income generated from Extended Fares will only have a small impact/decrement. Hence diployment of Grace period is a fair option with little impact on the income generated from Extended Fares.

4.15 What is the average trip duration and the most frequent trip duration among the bike rentals? Expore the factors that influence the rental duration.

Insights:
Insight 1 It appears that the most of the customers prefer to take the rides with either Normal trip durations or small trip durations and avoids trips with Long and Very Long durations.
Insight 2 The average trip duration of the Overall dataset is 30 minutes. However when the outliers are removed and re-evaluated, the average trip duration is down to 18 minutes.
Insight 3 Also for trips under 30 minutes, the average trip duration is 12 minutes.
Insight 4 The most frequent trip duration remains 6 minutes for all datasets limited by trip durations. This denoted that most customers rent the bikes for a quick short trip.
Insight 5 The customers tend to travel longer when it comes to One Way trips compared to Round Trips.
Insight 6 The most frequent trip duration remained same over the various dataset for One Way trips as 28 minutes. Meaning, most of the customers who prefer longer trips are using the full extent of Base Fare trip duration when it comes to One Way trips.
Insight 7 The most frequent trip duration remained same over the various dataset for Round Trips as 5 minutes. Meaning, most of the customers prefer short trips when it comes to Round Trips.
Insight 8 The customers prefer Smart bikes over other bike types when it comes to trips with longer durations.
Insight 9 The difference between Smart bike and other bikes increases with the dataset limitation. Say Smart bikes are have high average trip duration under 120 minutes compared to 30 minutes.
Insight 10 Standard and Electric bikes have close value of average trip durations. This denotes that these bike types have almost same customer's preference when it comes to trip durations.
Insight 11 The bike rides subjected to short-term subscriptions like One Day and Walk-up pases have higher average trip durations compared to longterm subscriptions say Monthly and Annual pass. And the difference increases with the trip durations limit.
Insight 12 The bike trips subjected to Extended fares has a heavy presence of outliers. When outliers are removed and dataset is limited to 120 minutes the average trip duration subjected to Extended Fares is 57 minutes.
Insight 13 The most frequent trip duration of Extended Fare is 31 minutes. This denotes that many customers who intended to return the bikes for Base Fare failed in returning the bike under 30 minutes by a margin of 1 minute.
Reforms proposed:
Reform 1 Most of the customers prefer to take the rides with either Normal trip durations or small trip durations and avoids trips with Long and Very Long durations. Even if this behaviour is to be expected, encouraging customers to ride the bikes for longer durations will generate income and improve business metrics.
Reform 2 Organizing 2hr bike rallies and other events will attract enthusiasts to ride the bike for longer durations.
Reform 3 Announcing Low Fares for tourists will attract them to rent the bike for longer durations.

4.16 What is the average trip distance and the most frequent trip distance among the bike rentals? Expore the factors that influence the bike ride distance.

Insights:
Insight 1 There is a fair amount of bike rental distribution when it comes to trip distances of Small, Normal and Large. This represents a productive business model covering all kinds of customer needs.
Insight 2 Only a small number of customers exists who take Very Small and Very Long bike rides. This is also a good sign, as Very Small rides represent customer dissatisfaction and having more number of Very Long rides will result in dissipations of bike availability among a geographical location.
Insight 3 A good amount of customers preferred Round Trips.
Insight 4 One Way trips has an average trip duration of 0.8 miles, while the most frequent trip duration is 0.5 miles.
Insight 5 Customers preferred Smart bikes for longer trip distances compared to other bike types.
Insight 6 The bike rides subjected to Monthly pass has lower average bike rentals compared to other pass types.
Insight 7 The short-term passes say One Day and Walk-up has higher value of frequent trip duration compared to longterm subscriptions. This is because the longterm passes are opted by working individuals for a ride to work, while short-term passes are preferred by tourists and explorers to explore city for longer rides.
Insight 8 The trips with Extended Fares has a high value of frequent trip duration compared to Base Fare. This is to be expected as trip distances and trip durations are correlated.
Reforms proposed:
Reform 1 Organizing long bike rallies and other events will attract enthusiasts to ride the bike for longer distances.

-------- End of 4. Investigation Summary --------


5. Credits:

================

5.1 Udacity Platform:

My sincere and deep gratitule for the Udacity platform for making this Data Analyst Nanodegree available

5.2 Instructors:

Sebastian Thrun
: INSTRUCTOR
About As the founder and president of Udacity, Sebastian’s mission is to democratize education. He is also the founder of Google X, where he led projects including the Self-Driving Car, Google Glass, and more.
Derek Steer
: CEO AT MODE
About Derek is the CEO of Mode Analytics. He developed an analytical foundation at Facebook and Yammer and is passionate about sharing it with future analysts. He authored SQL School and is a mentor at Insight Data Science.
Mike Yi
: INSTRUCTOR
About Mike is a content developer with a multidisciplinary academic background, including math, statistics, physics, and psychology. Previously, he worked on Udacity's Data Analyst Nanodegree program as a support lead.
Josh Bernhard
: DATA SCIENTIST
About Josh has been sharing his passion for data for nearly a decade at all levels of university, and as Lead Data Science Instructor at Galvanize. He's used data science for work ranging from cancer research to process automation.
David Venturi
: INSTRUCTOR
About Formerly a chemical engineer and data analyst, David created a personalized data science master's program using online resources. He has studied hundreds of online courses and is excited to bring the best to Udacity students.
Sam Nelson
: PRODUCT LEAD
About Sam Nelson is the Product Lead for Udacity’s Data Analyst, Business Analyst, and Data Foundations programs. He’s worked as an analytics consultant on projects in several industries, and is passionate about helping others improve their data skills.
Juno Lee
: CURRICULUM LEAD
About Juno is the curriculum lead for the School of Data Science. She has been sharing her passion for data and teaching, building several courses at Udacity. As a data scientist, she built recommendation engines, computer vision and NLP models, and tools to analyze user behavior.
Mat leonard
: CURRICULUM LEAD
About Mat, the curriculum lead is a former physicist, neuroscientist, and data scientist with a passion for education. Recently, he led the Deep learning Nanodegree foundation program covering state-of-the-art machine learning models.

5.3 Contact:

Vamshi Krishna Prime: Data Analyst


Once you're ready to finish your presentation, check the output by using nbconvert to export the notebook and set up a server for the slides. From the terminal or command line, use the following expression:

jupyter nbconvert <file_name>.ipynb --to slides --post serve --template output_toggle

This should open a tab in your web browser where you can scroll through your presentation. Sub-slides can be accessed by pressing 'down' when viewing its parent slide. Make sure you remove all of the quote-formatted guide notes like this one before you finish your presentation!


|

-------- End of Act 3: Explanatory Data Analysis --------

|